create or replace function tokenizer (p_string in varchar2 ,p_separators in varchar2 ) return dbms_sql.varchar2s is v_strs dbms_sql.varchar2s; begin with sel_string as (select p_string fullstring from dual) select substr(fullstring, beg+1, end_p-beg-1) token bulk collect into v_strs from (select beg, lead(beg) over (order by beg) end_p, fullstring from (select beg, fullstring from (select level beg, fullstring from sel_string connect by level <= length(fullstring)) where instr(p_separators ,substr(fullstring,beg,1)) >0 union all select 0, fullstring from sel_string union all select length(fullstring)+1, fullstring from sel_string)) where end_p is not null and end_p>beg+1; return v_strs; end;Usage is very simple. First parameter is string that should be tokenized, second parameter is string with characters that should be treated as word separators. Example:
SQL> set serveroutput on SQL> declare v_strs dbms_sql.varchar2s; 2 begin 3 v_strs := tokenizer('I was in need to have function that transforms string into table of words.' 4 , ' ,:;.<>[]{}()'); 5 for i in v_strs.first..v_strs.last loop 6 dbms_output.put_line(v_strs(i)); 7 end loop; 8 end; 9 / I was in need to have function that transforms string into table of words PL/SQL procedure successfully completed.Let me explain some elements
with sel_string as (select p_string fullstring from dual)is just to bind string parameter into SQL just once. Otherwise it will have to be binded in many places.
select beg, fullstring from (select level beg, fullstring from sel_string connect by level <= length(fullstring)) where instr(p_separators ,substr(fullstring,beg,1)) >0Internal select generates numbers from 1 to length of string. Outer select chooses only these rows (numbers) that are separators in string.
union all select 0, fullstring from sel_string union all select length(fullstring)+1, fullstring from sel_string))adds separator at beginning and end of string, so first and last work can be recognized.
select beg, lead(beg) over (order by beg) end_pgives as positions of two consecutive separators, and finally
select substr(fullstring, beg+1, end_p-beg-1) token bulk collect into v_strs ... where end_p is not null and end_p>beg+1;select words.
Might be you find it useful,
Paweł
9 comments:
Genius dude!!!
Inspired by you, i created my version of tokenizer:
FUNCTION Tokenizer (p_string VARCHAR2
,p_separators in VARCHAR2)
RETURN dbms_sql.varchar2s IS
l_token_tbl dbms_sql.varchar2s;
pattern varchar2(250);
BEGIN
pattern := '[^(' || p_separators || ')]+' ;
select regexp_substr(p_string, pattern,1,level) token
bulk collect into l_token_tbl
from dual
where regexp_substr(p_string, pattern,1,level) is not null
connect by regexp_instr(p_string, pattern,1, level ) > 0;
RETURN l_token_tbl;
END Tokenizer;
Thanks Anup,
This is really good solution. And more compact.
Suggest using "dbms_sql.varchar2a"
instead of "dbms_sql.varchar2s"...
type varchar2a is table of varchar2(32767) index by binary_integer;
-- bug 2410688: for users who require larger than varchar2(256),
-- this type has been introduced together with parse overloads
-- that take this type.
type varchar2s is table of varchar2(256) index by binary_integer;
-- Note that with the introduction of varchar2a we will deprecate
-- this type, will phase out over a number of releases.
Yes, you are right. It's better to use dbms_sql.varchar2a
there's a problem with second tokenizeer function: throws a error when separators include []
/*I too got inspired to write a tokenizer which will return number of tokens in a string.
*/
CREATE OR REPLACE FUNCTION F_GET_NO_OF_TOKENS (
in_string IN VARCHAR2
)
RETURN NUMBER
as
num NUMBER;
istart number;
ipos number;
tmp_str VARCHAR2(300);
str_len number;
BEGIN
num := 0 ;
istart := 1;
tmp_str := Rtrim(Ltrim(in_string));
while (1=1)
loop
ipos := Instr (tmp_str ,' ',istart);
str_len := LENGTH(tmp_str);
IF ( ipos = 0) AND ( str_len > 1) THEN
num:= num +1;
END IF;
exit when ( ipos =0);
IF ipos <> 0 THEN
num:= num +1;
tmp_str := Rtrim(Ltrim(Substr(tmp_str ,ipos+1)));
END IF;
end loop;
RETURN num;
EXCEPTION
WHEN NO_DATA_FOUND THEN
NULL;
WHEN OTHERS THEN
RAISE;
END F_GET_NO_OF_TOKENS;
/
Hi Bino,
The tokenizer you provided checks only for spaces. In that case I would rather use this code to archive the same goal. Less code = less chances for error.
CREATE OR REPLACE FUNCTION F_GET_NO_OF_TOKENS (
in_string IN VARCHAR2
) RETURN NUMBER
is
t_str varchar2(4000);
l number;
BEGIN
t_str := trim(in_string);
l := length(t_str);
while l > length(replace(t_str, ' ',' ')) loop -- Any 2 spaces replaced by 1
t_str := replace(t_str, ' ',' ');
l := length(t_str);
end loop;
return l - length(replace(t_str, ' ', ''))+1; -- Remove spaces
END F_GET_NO_OF_TOKENS;
/
BR/Pawel
Use this For Token
SELECT LEVEL
, SUBSTR
( STRING_TO_TOKENIZE
, DECODE(LEVEL, 1, 1, INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL-1)+1)
, INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL) -
DECODE(LEVEL, 1, 1, INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL-1)+1)
)
FROM ( SELECT '&String_To_Tokenize'||'&Delimiter' AS STRING_TO_TOKENIZE
, '&Delimiter' AS DELIMITER
FROM DUAL
)
CONNECT BY INSTR(STRING_TO_TOKENIZE, DELIMITER, 1, LEVEL)>0
ORDER BY LEVEL ASC
Post a Comment