Monday, October 27, 2008

Split a MySQL text column into single words

Well I needed to split a column with text in it into its constituent words. Didn't do much clean up on the source. Removing multiple spaces and special characters in a sane way would be nice. I just need a qad solution though so this will have to suffice. Love to see anyone liberal use of REPLACE() to make this better.

Unfortunately not handling multiple space white space means lots of "blank" words. DISTINCT is our friend though. At the bottom I note that use of the INSERT...ON DUPLICATE UPDATE would be a cool way of counting occurrences as we build the table. Guess that is an exercise for another day.


--Step 1
--Create a table with sequence numbers. This Step can be skipped if you keep one around for this purpose
drop table if exists docsql_seq
go
--Could not find a way to create a table from a select with an auto_increment that
--did not have another column.
--The cartesian join on the information_schema.columns table should give us plenty of room to grow.
--change the LIMIT 100 if you have long blocks of text
CREATE TABLE docsql_seq
(
val INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (val)
)
SELECT
'1' AS Dummy
FROM
INFORMATION_SCHEMA.COLUMNS C1,
INFORMATION_SCHEMA.COLUMNS C2 LIMIT 100
go
--drop table is here for testing
drop table if exists docsql_word_xref
go
--With SUBSTRING_INDEX this was much easier than with the other DB
--the spaces CONCAT to the begining and end of the column handle columns with a single word
-- and ensure you do not miss the final word in the column
--Basically we will create the table on the fly with the select
create table docsql_word_xref
SELECT
distinct T1.Word
FROM
(SELECT
SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(' ',mt.Words,' '),' ',t.val+1),' ',-1) AS Word
FROM docsql_rt_raw mt,
docsql_seq t
WHERE t.val < (LENGTH(mt.Words) - LENGTH(REPLACE(mt.Words, ' ',''))) ) T1 WHERE T1.Word <> ''
go
--List out your words
select * from docsql_word_xref order by Word
go

--It would be interesting to use MySQL's INSERT OR UPDATE functionality to count occurances
--Cleaning up the text with word separators would be a nice addition.
--DocGyver

Followers