Search Oracle Blogs

List of Blogs maintained by Paweł Barut.


Thursday, December 07, 2006

UTL_MATCH - String Similarity in Oracle

You can compare string using equality operator (=), or using similarity operator (LIKE). But there are cases where it is not enough. You Can use UTL_MATCH package to calculate string similarity index. This packages offers 4 functions that take two strings as parameters:
  • edit_distance - algorithm by Levenshtein - returns number of edits that must be done to change one string into second,
  • edit_distance_similarity - normalized results of edit_distance in percents - integer values,
  • jaro_winkler - returns similarity based on Jaro-Winkler distance algorithm,
  • jaro_winkler_similarity - same as above but presented as integer in range 0-100.
Lets teke a look how it works. First we can create sample table and insert some data:
create table countries(name varchar2(15) not null);
insert into countries values ('Poland');
insert into countries values ('Germany');
insert into countries values ('United States');
insert into countries values ('Portugal');
insert into countries values ('Czech Republic');
insert into countries values ('China');
insert into countries values ('Slovakia');
insert into countries values ('Slovenia');
commit;
Now we can take a look at results. Lets compare to miss-spelled country name: 'Slovnia'
select name
,to_char(utl_match.edit_distance(name, 'Slovnia'),'999') edit_dist
,to_char(utl_match.edit_distance_similarity(name, 'Slovnia'),'999') edit_dist_sim
,to_char(utl_match.jaro_winkler(name, 'Slovnia'),'999d9999') jaro_winkler
,to_char(utl_match.jaro_winkler_similarity(name, 'Slovnia'),'999') jaro_winkler_sim
from countries
order by jaro_winkler_sim desc;

NAME EDIT EDIT JARO_WINK JARO
-------------------- ---- ---- --------- ----
Slovenia 1 88 .9750 97
Slovakia 2 75 .8881 88
China 5 29 .5619 56
United States 12 8 .5531 55
Poland 6 15 .5317 53
Portugal 7 13 .5119 51
Germany 7 0 .3571 35
Czech Republic 13 8 .0000 0
Above we can observe differences in algorithms.
Lets test it on NULLs:
SQL> select to_char(utl_match.edit_distance('test', NULL),'999')
2 edit_dist
3 ,to_char(utl_match.edit_distance_similarity('test', NULL),'999')
4 edit_dist_sim
5 ,to_char(utl_match.jaro_winkler('test', NULL),'999d9999')
6 jaro_winkler
7 ,to_char(utl_match.jaro_winkler_similarity('test', NULL),'999')
8 jaro_winkler_sim
9 from dual;

EDIT EDIT JARO_WINK JARO
---- ---- --------- ----
-1 125 .0000 0
We can see that using edit_distance on NULLs migth be dengerous.
All samples were run on Oracle 10g XE

Hope You find it usefull.
Paweł

4 comments:

Stew said...

This is a very nice write-up of the comparison between the methods.

Where I work, they haven't upgrade our production systems to 10g R2 yet, but I need to implement the Jaro-Winkler calculation in my code NEXT WEEK. Have you seen implemented in Oracle anywhere? I could REALLY use it right now.

Thanks,

Stew

Paweł Barut said...

You can try to deploy java as stored procedure, and create wrapper for it. Some time ago I was experimenting with SecondString Project

Regards,
Paweł

Stew said...

Thanks for the suggestion, but I don't have a java implementation handy either. I found a VB version and converted it to PL/SQL, but it returns slightly different values from those you posted! Aagh!

If you care:

Slovenia .9583
Slovakia .8135
China .5619
United States .5531
Poland .5317
Portugal .5119
Germany .3571
Czech Republic .0000

I didn't write the VB version and have no clue what's wrong in it.

[sigh]

Anonymous said...

Thanks. Just what I needed. - Ed Kosciuszko

 

Copyright © Paweł Barut
Printing from DOS to USB Printer