Search Oracle Blogs

List of Blogs maintained by Paweł Barut.


Saturday, March 03, 2007

Yet another tokenizer in Oracle

I was in need to have function that transforms string into table of words. Also words can be separated by many different chars. For this purpose i've created function:
create or replace function tokenizer
  (p_string in varchar2
  ,p_separators in varchar2
  ) return dbms_sql.varchar2s
is
  v_strs dbms_sql.varchar2s;
begin
  with sel_string as 
      (select p_string fullstring 
         from dual)
  select substr(fullstring, beg+1, end_p-beg-1) token
   bulk collect into v_strs
   from (select beg, lead(beg) over (order by beg) end_p, fullstring
           from (select beg, fullstring
                   from (select level beg, fullstring
                           from sel_string
                         connect by level <= length(fullstring))
                   where instr(p_separators ,substr(fullstring,beg,1)) >0
                 union all
                 select 0, fullstring from sel_string
                 union all
                 select length(fullstring)+1, fullstring from sel_string))
    where end_p is not null
      and end_p>beg+1;
  return v_strs;
end;
Usage is very simple. First parameter is string that should be tokenized, second parameter is string with characters that should be treated as word separators. Example:
SQL> set serveroutput on
SQL> declare v_strs dbms_sql.varchar2s;
  2  begin
  3    v_strs := tokenizer('I was in need to have function that transforms string into table of words.'
  4        , ' ,:;.<>[]{}()');
  5    for i in v_strs.first..v_strs.last loop
  6      dbms_output.put_line(v_strs(i));
  7    end loop;
  8  end;
  9  /
I                                                                               
was                                                                             
in                                                                              
need                                                                            
to                                                                              
have                                                                            
function                                                                        
that                                                                            
transforms                                                                      
string                                                                          
into                                                                            
table                                                                           
of                                                                              
words                                                                           

PL/SQL procedure successfully completed.
Let me explain some elements
  with sel_string as 
      (select p_string fullstring 
         from dual)
is just to bind string parameter into SQL just once. Otherwise it will have to be binded in many places.
                 select beg, fullstring
                   from (select level beg, fullstring
                           from sel_string
                         connect by level <= length(fullstring))
                   where instr(p_separators ,substr(fullstring,beg,1)) >0
Internal select generates numbers from 1 to length of string. Outer select chooses only these rows (numbers) that are separators in string.
                 union all
                 select 0, fullstring from sel_string
                 union all
                 select length(fullstring)+1, fullstring from sel_string))
adds separator at beginning and end of string, so first and last work can be recognized.
 select beg, lead(beg) over (order by beg) end_p
gives as positions of two consecutive separators, and finally
  select substr(fullstring, beg+1, end_p-beg-1) token
   bulk collect into v_strs
   ...
    where end_p is not null
      and end_p>beg+1;
select words.

Might be you find it useful,
Paweł

9 comments:

Anup said...

Genius dude!!!

Anup said...

Inspired by you, i created my version of tokenizer:

FUNCTION Tokenizer (p_string VARCHAR2
,p_separators in VARCHAR2)
RETURN dbms_sql.varchar2s IS
l_token_tbl dbms_sql.varchar2s;
pattern varchar2(250);
BEGIN

pattern := '[^(' || p_separators || ')]+' ;

select regexp_substr(p_string, pattern,1,level) token
bulk collect into l_token_tbl
from dual
where regexp_substr(p_string, pattern,1,level) is not null
connect by regexp_instr(p_string, pattern,1, level ) > 0;

RETURN l_token_tbl;
END Tokenizer;

Paweł Barut said...

Thanks Anup,
This is really good solution. And more compact.

Anonymous said...

Suggest using "dbms_sql.varchar2a"
instead of "dbms_sql.varchar2s"...

type varchar2a is table of varchar2(32767) index by binary_integer;
-- bug 2410688: for users who require larger than varchar2(256),
-- this type has been introduced together with parse overloads
-- that take this type.

type varchar2s is table of varchar2(256) index by binary_integer;
-- Note that with the introduction of varchar2a we will deprecate
-- this type, will phase out over a number of releases.

Paweł Barut said...

Yes, you are right. It's better to use dbms_sql.varchar2a

Anonymous said...

there's a problem with second tokenizeer function: throws a error when separators include []

Bino Mathew said...

/*I too got inspired to write a tokenizer which will return number of tokens in a string.
*/
CREATE OR REPLACE FUNCTION F_GET_NO_OF_TOKENS (
in_string IN VARCHAR2
)

RETURN NUMBER
as
num NUMBER;
istart number;
ipos number;
tmp_str VARCHAR2(300);
str_len number;
BEGIN
num := 0 ;
istart := 1;
tmp_str := Rtrim(Ltrim(in_string));
while (1=1)
loop

ipos := Instr (tmp_str ,' ',istart);
str_len := LENGTH(tmp_str);

IF ( ipos = 0) AND ( str_len > 1) THEN

num:= num +1;

END IF;

exit when ( ipos =0);

IF ipos <> 0 THEN
num:= num +1;
tmp_str := Rtrim(Ltrim(Substr(tmp_str ,ipos+1)));

END IF;

end loop;


RETURN num;

EXCEPTION
WHEN NO_DATA_FOUND THEN
NULL;
WHEN OTHERS THEN

RAISE;
END F_GET_NO_OF_TOKENS;
/

Paweł Barut said...

Hi Bino,

The tokenizer you provided checks only for spaces. In that case I would rather use this code to archive the same goal. Less code = less chances for error.

CREATE OR REPLACE FUNCTION F_GET_NO_OF_TOKENS (
in_string IN VARCHAR2
) RETURN NUMBER
is
t_str varchar2(4000);
l number;
BEGIN
t_str := trim(in_string);
l := length(t_str);
while l > length(replace(t_str, ' ',' ')) loop -- Any 2 spaces replaced by 1
t_str := replace(t_str, ' ',' ');
l := length(t_str);
end loop;
return l - length(replace(t_str, ' ', ''))+1; -- Remove spaces
END F_GET_NO_OF_TOKENS;
/


BR/Pawel

Anonymous said...

Use this For Token
SELECT LEVEL
, SUBSTR
( STRING_TO_TOKENIZE
, DECODE(LEVEL, 1, 1, INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL-1)+1)
, INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL) -
DECODE(LEVEL, 1, 1, INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL-1)+1)
)
FROM ( SELECT '&String_To_Tokenize'||'&Delimiter' AS STRING_TO_TOKENIZE
, '&Delimiter' AS DELIMITER
FROM DUAL
)
CONNECT BY INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL)>0
ORDER BY LEVEL ASC

 

Copyright © Paweł Barut
Printing from DOS to USB Printer