Unfortunately not handling multiple space white space means lots of "blank" words. DISTINCT is our friend though. At the bottom I note that use of the INSERT...ON DUPLICATE UPDATE would be a cool way of counting occurrences as we build the table. Guess that is an exercise for another day.
--Step 1
--Create a table with sequence numbers. This Step can be skipped if you keep one around for this purpose
drop table if exists docsql_seq
go
--Could not find a way to create a table from a select with an auto_increment that
--did not have another column.
--The cartesian join on the information_schema.columns table should give us plenty of room to grow.
--change the LIMIT 100 if you have long blocks of text
CREATE TABLE docsql_seq
(
val INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (val)
)
SELECT
'1' AS Dummy
FROM
INFORMATION_SCHEMA.COLUMNS C1,
INFORMATION_SCHEMA.COLUMNS C2 LIMIT 100
go
--drop table is here for testing
drop table if exists docsql_word_xref
go
--With SUBSTRING_INDEX this was much easier than with the other DB
--the spaces CONCAT to the begining and end of the column handle columns with a single word
-- and ensure you do not miss the final word in the column
--Basically we will create the table on the fly with the select
create table docsql_word_xref
SELECT
distinct T1.Word
FROM
(SELECT
SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(' ',mt.Words,' '),' ',t.val+1),' ',-1) AS Word
FROM docsql_rt_raw mt,
docsql_seq t
WHERE t.val < (LENGTH(mt.Words) - LENGTH(REPLACE(mt.Words, ' ',''))) ) T1 WHERE T1.Word <> ''
go
--List out your words
select * from docsql_word_xref order by Word
go
--It would be interesting to use MySQL's INSERT OR UPDATE functionality to count occurances
--Cleaning up the text with word separators would be a nice addition.
--DocGyver
No comments:
Post a Comment