Option to save and load trained model (some workaround suggested) #2

sekarpdkt · 2018-04-16T11:52:45Z

Hi

I tried to save the trained dictionary and reload it. It is not working. Do you have any idea how to do it? What I tried. To save the trained dictionary

myData = dict()
myData["_words"]=ss._words
myData["_deletes"]=ss._deletes
myData["_below_threshold_words"]=ss._below_threshold_words
myData["_max_length"]=ss._max_length
myData["_distance_algorithm"]=ss._distance_algorithm
myData["_max_dictionary_edit_distance"]=ss._max_dictionary_edit_distance
myData["_prefix_length"]=ss._prefix_length
myData["_count_threshold"]=ss._count_threshold
myData["_compact_mask"]=ss._compact_mask

filename = 'SymSpell_dictionary.json'
print('Saving dictionary...')
with open(filename, 'w',encoding='ISO-8859-1') as fp:
    json.dump(myData, fp)        
print('Saved dictionary...')

Once saved, tried to reload it like


myData = dict()
print('Loading dictionary...')
filename = 'SymSpell_dictionary.json'

with open(filename, 'r',encoding='ISO-8859-1') as fp:
    myData = json.load(fp)
print('Loaded dictionary...')

ss._words=myData["_words"]
ss._deletes=myData["_deletes"]
ss._below_threshold_words=myData["_below_threshold_words"]
ss._max_length=myData["_max_length"]
ss._distance_algorithm=myData["_distance_algorithm"]
ss._max_dictionary_edit_distance=myData["_max_dictionary_edit_distance"]
ss._prefix_length=myData["_prefix_length"]
ss._count_threshold=myData["_count_threshold"]
ss._compact_mask=myData["_compact_mask"]

It is not working. It is loading, but spell correction is not working.

As a workaround, i added following two functions in main file, which are working

    def save_words_with_freq_as_json(self,filename,encoding="utf8"):
        print('Saving dictionary...')
        with open(filename, 'w',encoding=encoding) as fp:
            json.dump(self._words, fp)        
        print('Saved dictionary...')
        return;
    def load_words_with_freq_from_json_and_build_dictionary(self,filename,encoding="utf8"):
        print('Loading dictionary...')
        myData = dict()

        with open(filename, 'r',encoding=encoding) as fp:
            myData = json.load(fp)
        for word in myData:
            self._create_dictionary_entry(word,myData[word])        
        print('Loaded dictionary...')

To use it, you can save like

filename = 'SymSpell_Dctionary_Word.json'
ss.save_words_with_freq_as_json(filename,encoding='ISO-8859-1');

and load like


ss = SymSpell(max_dictionary_edit_distance=3)
filename = 'SymSpell_Dctionary_Word.json'
ss.load_words_with_freq_from_json_and_build_dictionary(filename,encoding='ISO-8859-1');

Above is working, if anyone interested. But, if we have save and load deletes/words etc, it will be faster compared to training every time.

The text was updated successfully, but these errors were encountered:

ne3x7 · 2018-04-16T12:05:21Z

Thanks for your suggestion.

There is a simple load_dictionary() method to read whitespace-separated word-count pairs from file. The creation of this file is not implemented, but as you correctly noticed is basically dumping _words dictionary.

Saving and loading _deletes dictionary was not intended in original SymSpell, because it is well-optimized to build it fast. I did not estimate the speed yet, but it seems that building _deletes with load_dictionary() is fast enough.

sekarpdkt · 2018-04-17T14:56:01Z

It was a simple issue. When we load JSON from file, keys are stored as string, where as _deletes keys are int (hash value). We need to do something like

for hs in sorted(myDeleteData):
    ss._deletes[int(hs)]=myDeleteData[hs]

and good news is its working. The key here is, you need to convert hs into int ([int(hs)]) while transferring it to ss._deletes. I will be sending some pull request. I also implemented multi threading for creating the hash table in python. With four threads, it took 3 mins, intead of 7 min with out multi thread for loading 500K words in my local machine

sekarpdkt · 2018-04-17T16:21:41Z

Raised a pull request

ne3x7 · 2018-04-20T21:25:35Z

Hi, thanks for the contribution, I appreciate it a lot. I will look through shortly and accept.

Do you want to join forces to further improve it?

sekarpdkt · 2018-04-21T03:57:40Z

Definitely would like to join.. I am now working on some more improvement. Will let you know once it is done. But, if you are merging my changes, revert the prime number back to original. I was not aware of specialty of those two numbers :-) FNV Hash algorithm.

ne3x7 added the enhancement New feature or request label Apr 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to save and load trained model (some workaround suggested) #2

Option to save and load trained model (some workaround suggested) #2

sekarpdkt commented Apr 16, 2018

ne3x7 commented Apr 16, 2018

sekarpdkt commented Apr 17, 2018 •

edited

Loading

sekarpdkt commented Apr 17, 2018

ne3x7 commented Apr 20, 2018

sekarpdkt commented Apr 21, 2018

Option to save and load trained model (some workaround suggested) #2

Option to save and load trained model (some workaround suggested) #2

Comments

sekarpdkt commented Apr 16, 2018

ne3x7 commented Apr 16, 2018

sekarpdkt commented Apr 17, 2018 • edited Loading

sekarpdkt commented Apr 17, 2018

ne3x7 commented Apr 20, 2018

sekarpdkt commented Apr 21, 2018

sekarpdkt commented Apr 17, 2018 •

edited

Loading