java - Has anyone written a Fuzzy date match to catch data entry errors? -
i wrote routine in pl/sql try , match dates there might typographical/data entry errors.
it works, see if has other/better ideas. routine not need in pl/sql, read many languages.
function fuzzy_date_match(in_date_1 date, in_date_2 date) return number month_1 number(2); month_2 number(2); day_1 number(2); day_2 number(2); year_1 number(4); year_2 number(4); match_score number(3) := 0; begin if trunc(in_date_1) = trunc(in_date_2) match_score := 100; else if abs(trunc(in_date_1) - trunc(in_date_2)) < 2 match_score :=50; else month_1 := to_number(to_char(in_date_1,'mm')); month_2 := to_number(to_char(in_date_2,'mm')); if month_1 = month_2 match_score := match_score + 15; else if (abs(month_1 - month_2) < 2) or (to_number(substr(lpad(month_1,2,'0'),2,1)||substr(lpad(month_1,2,'0'),1,1)) = month_2) match_score := match_score + 7; end if; end if; day_1 := to_number(to_char(in_date_1,'dd')); day_2 := to_number(to_char(in_date_2,'dd')); if day_1 = day_2 match_score := match_score + 10; else if (abs(day_1 - day_2) < 2) or (to_number(substr(lpad(day_1,2,'0'),2,1)||substr(lpad(day_1,2,'0'),1,1)) = day_2) match_score := match_score + 5; end if; end if; year_1 := to_number(to_char(in_date_1,'yyyy')); year_2 := to_number(to_char(in_date_2,'yyyy')); if year_1 = year_2 match_score := match_score + 25; else if (abs(year_1 - year_2) < 2) or (to_number(substr(lpad(year_1,2,'0'),4,1)||substr(lpad(year_1,2,'0'),3,1)) = to_number(substr(to_char(year_2),3))) match_score := match_score + 12; end if; end if; end if; end if; return match_score; end fuzzy_date_match;
the basic concept compare 2 dates , return value between 0 , 100 100 exact match , 0 no match. types of errors looking single digit errors, , transposition errors. assumption years have more weight months, in turn have more weight days.
i tried googling fuzzy date matching, answers typically deal distance between dates opposed data entry errors.
aall appreciated.
paul
no need implement yourself. take @ utl_match package standard part of oracle. here's quick summary:
function edit_distance(s1 in varchar2, s2 in varchar2) return pls_integer; -- computes levenshtein distance between s1 , s2. function jaro_winkler(s1 in varchar2, s2 in varchar2) return binary_double; -- similar levenshtein distance, tries account mis-typings, -- character swaps, etc. function edit_distance_similarity(s1 in varchar2, s2 in varchar2) return pls_integer; -- similar levenshtein distance, returns integer 0 100 -- 0 means no similarity , 100 means strings identical. function jaro_winkler_similarity(s1 in varchar2, s2 in varchar2) return pls_integer; -- similar above, based on jaro-winkler.
here's quick example:
select utl_match.edit_distance('potato', 'tomato') lev, utl_match.edit_distance_similarity('potato', 'tomato') lev_sim, to_number(utl_match.jaro_winkler('potato', 'tomato')) jw, utl_match.jaro_winkler_similarity('potato', 'tomato') jw_sim dual;
sounds me might able use jaro_winkler_similarity. convert both dates standard string representation (e.g. to_char(adate, 'dd/mm/yyyy hh24:mi:ss')) , compare them.
(incidentally - to_number
applied results of jaro_winkler
because oracle throws ora-031115 : unsupported network datatype or representation
when jaro_winkler called because returns binary_double, oracle interface routines on windows platform can't seem deal with. why have type if can't use type? ??? :-)
share , enjoy.
Comments
Post a Comment