create or replace function tokenizer
(p_string in varchar2
,p_separators in varchar2
) return dbms_sql.varchar2s
is
v_strs dbms_sql.varchar2s;
begin
with sel_string as
(select p_string fullstring
from dual)
select substr(fullstring, beg+1, end_p-beg-1) token
bulk collect into v_strs
from (select beg, lead(beg) over (order by beg) end_p, fullstring
from (select beg, fullstring
from (select level beg, fullstring
from sel_string
connect by level <= length(fullstring))
where instr(p_separators ,substr(fullstring,beg,1)) >0
union all
select 0, fullstring from sel_string
union all
select length(fullstring)+1, fullstring from sel_string))
where end_p is not null
and end_p>beg+1;
return v_strs;
end;
Usage is very simple. First parameter is string that should be tokenized, second parameter is string with characters that should be treated as word separators. Example:
SQL> set serveroutput on
SQL> declare v_strs dbms_sql.varchar2s;
2 begin
3 v_strs := tokenizer('I was in need to have function that transforms string into table of words.'
4 , ' ,:;.<>[]{}()');
5 for i in v_strs.first..v_strs.last loop
6 dbms_output.put_line(v_strs(i));
7 end loop;
8 end;
9 /
I
was
in
need
to
have
function
that
transforms
string
into
table
of
words
PL/SQL procedure successfully completed.
Let me explain some elements
with sel_string as
(select p_string fullstring
from dual)
is just to bind string parameter into SQL just once. Otherwise it will have to be binded in many places.
select beg, fullstring
from (select level beg, fullstring
from sel_string
connect by level <= length(fullstring))
where instr(p_separators ,substr(fullstring,beg,1)) >0
Internal select generates numbers from 1 to length of string. Outer select chooses only these rows (numbers) that are separators in string.
union all
select 0, fullstring from sel_string
union all
select length(fullstring)+1, fullstring from sel_string))
adds separator at beginning and end of string, so first and last work can be recognized.
select beg, lead(beg) over (order by beg) end_pgives as positions of two consecutive separators, and finally
select substr(fullstring, beg+1, end_p-beg-1) token
bulk collect into v_strs
...
where end_p is not null
and end_p>beg+1;
select words.
Might be you find it useful,
Paweł



5 comments:
Genius dude!!!
Inspired by you, i created my version of tokenizer:
FUNCTION Tokenizer (p_string VARCHAR2
,p_separators in VARCHAR2)
RETURN dbms_sql.varchar2s IS
l_token_tbl dbms_sql.varchar2s;
pattern varchar2(250);
BEGIN
pattern := '[^(' || p_separators || ')]+' ;
select regexp_substr(p_string, pattern,1,level) token
bulk collect into l_token_tbl
from dual
where regexp_substr(p_string, pattern,1,level) is not null
connect by regexp_instr(p_string, pattern,1, level ) > 0;
RETURN l_token_tbl;
END Tokenizer;
Thanks Anup,
This is really good solution. And more compact.
Suggest using "dbms_sql.varchar2a"
instead of "dbms_sql.varchar2s"...
type varchar2a is table of varchar2(32767) index by binary_integer;
-- bug 2410688: for users who require larger than varchar2(256),
-- this type has been introduced together with parse overloads
-- that take this type.
type varchar2s is table of varchar2(256) index by binary_integer;
-- Note that with the introduction of varchar2a we will deprecate
-- this type, will phase out over a number of releases.
Yes, you are right. It's better to use dbms_sql.varchar2a
Post a Comment