Hello, please sign in or register
You are here: Home

Compound Words

Could this be the greatest script in the world... No, this is just a tribute.

Compound words like winemaking, cookbook, hitchhikers and many more have presented a problem in fulltext searches. For example, searching for "Hitchhikers guide to the galaxy" your query would contain the condition WHERE MATCH(title) AGAINST ('+hitchhikers +guide +to +the +galaxy' IN BOOLEAN MODE). Which would be fine, unless you like me have preknowledge that there are some editions of this book where the word Hitchhikers is written hitch hikers (there's a space in between!). This script shows how to create a query which will include both spellings e.g. Hitchhikers and Hitch Hikers, using the pspell dictionary.

If you want to learn more about MySQL Fulltext MySQL Fulltext, or the use of parenthesis and syntax in the sql query MySQL Boolean Full-Text Searches


[CODE=CompoundWords.php]
<?php
// ***********************************************************
// Using pspell datadictionary to break up compound words
// ***********************************************************/
$int = pspell_new("en", "", "", "", (PSPELL_FAST|PSPELL_RUN_TOGETHER));

// Example search input strings // SQL results will be synonomous
// ***********************************************************/
print f("Hitchhikers guide to the galaxy");
// SQL: '+(hitchhiker (+hitch +hiker)) +guide +to +galaxy'
print f("Hitch hikers guide to the galaxy");
// SQL: '+((+hitch +hiker) hitchhiker) +guide +to +galaxy'

print f("wine making and stuff"]);
// SQL: '+((+wine +making) winemaking) +and +stuff'
print f("winemaking and stuff");
// SQL: '+(winemaking (+wine +making)) +and +stuff'

print f("cook book"); // SQL: '+(cookbook (+cook +book))'
print f("cookbook") // SQL: '+((+cook +book) cookbook)'

print "Handling spelling mistakes";
print f("harrypoter harrypotter"); // SQL: '+((+cook +book) cookbook)'

//////////////////////////////////////////////////////////////////////////

function f($s){
    global $int;

    // remove characters from string
    $s = strtolower(rtrim(str_replace(array("'s ", 's ','<','>',',','.','?','/',':',';','@',"'","'",'~','#','{','[','}', ']','|','!','"','"','£','$','%','^','&','*','(',')','_','+','-','='), ' ', $s.' '), ' '));

    $e = array('the','a','and','of','to'); //escape words
    for($i=0;$i<count($e);$i++){$e[$i] = ' '.$e[$i].' ';}

    $s = str_replace($e, ' ', $s); // remove escape words
    $a = preg_split('/[\W]+?/',$s);    // create an array of words

    for($i=0;$i<count($a);$i++){ //loop through
        // find matches for words, by rearranging and compunding words in the search
        // for example "wine making" -> "winemaking"
        if($i<count($a)-1 && pspell_check($int, $a[$i].$a[$i+1])){ // if not the last word && creates a compound word with its next neighbour
            $m[$a[$i].$a[$i+1]] = array($a[$i]=>split_word($a[$i]), $a[$i+1]=>split_word($a[$i+1]));
            $i++; // jump to word after
        }
        else {
            $m[$a[$i]] = split_word($a[$i]);
        }
    }
// Uncomment the next line to see the structure created by the aforementioned script
// print_r($m);

// Initiate the function 'h' to create the SQL
    $r=h($m);

// Print the SQL
    return $r.'
';//$r.'';
}


function h($a){
    $r = array();
    foreach($a as $k => $v){
        if(count($v)>0){
            $r[] = '('.$k.' ('.h($v).'))';
        } else {
            $r[] = ''.$k.'';
        }
    }
    return ' +'.implode(' +', $r);
}


function split_word($s){
    // include pspell object
    global $int;

    // find spelling suggestions
    $c = pspell_suggest($int, $s);
    // loop through alternatives
    $t = array();
    for($j=0;$j<count($c);$j++){
        // because Fulttext is not case sensative, comparison == can not distingush.
        $c[$j]=strtolower($c[$j]);
        // CHECK: is the alternative form a split version of the original?
        if($s != $c[$j] && str_replace(' ','', $c[$j]) == $s && strlen($s) > 6 ){
            $a = preg_split('/[\W]+?/',$c[$j]);    // create an array of words
            for($i=0;$i<count($a);$i++){
                $t[$a[$i]] = split_word($a[$i]);
            }
        }
        // ELSE: if where this is a a first value. It suggests alternate spelling
        else if($j == 0 && $c[$j] != $s){
            $t[$c[$j]] = array();
        }
    }
    return $t;
}
?>
[/CODE]

Limitations

This does not support multiple suggestions by pspell, for example, the word Houseplants will match and return ho useplants, house plants, which in the current sql statement looks like +(Houseplant (+ho +(useplant (+use +plant)) +house +plant))
The SQL query really needs to look something like
+(Houseplant ((+ho +(useplant (+use +plant))) (+house +plant)))

Comments

ninestab123
ninest123 One canada goose pas cher gucci outlet thing
Created 21/07/16
ZZZZ
michael kors handbags nike trainers
Created 05/10/16
Title*
Comment

Prove you are not a robot

To prove you are not a robot, please type in the six character code you see in the picture below
Security confirmation codeI can't see this!
Contact
Name*
Email never shown*
Home Page

Author

Andrew Dodson
Since:Feb 2007

Comment | flag

Categories

Bookmark and Share