Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to save and load trained model (some workaround suggested) #2

Open
sekarpdkt opened this issue Apr 16, 2018 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@sekarpdkt
Copy link
Contributor

Hi

I tried to save the trained dictionary and reload it. It is not working. Do you have any idea how to do it? What I tried. To save the trained dictionary

myData = dict()
myData["_words"]=ss._words
myData["_deletes"]=ss._deletes
myData["_below_threshold_words"]=ss._below_threshold_words
myData["_max_length"]=ss._max_length
myData["_distance_algorithm"]=ss._distance_algorithm
myData["_max_dictionary_edit_distance"]=ss._max_dictionary_edit_distance
myData["_prefix_length"]=ss._prefix_length
myData["_count_threshold"]=ss._count_threshold
myData["_compact_mask"]=ss._compact_mask

filename = 'SymSpell_dictionary.json'
print('Saving dictionary...')
with open(filename, 'w',encoding='ISO-8859-1') as fp:
    json.dump(myData, fp)        
print('Saved dictionary...')

Once saved, tried to reload it like


myData = dict()
print('Loading dictionary...')
filename = 'SymSpell_dictionary.json'

with open(filename, 'r',encoding='ISO-8859-1') as fp:
    myData = json.load(fp)
print('Loaded dictionary...')

ss._words=myData["_words"]
ss._deletes=myData["_deletes"]
ss._below_threshold_words=myData["_below_threshold_words"]
ss._max_length=myData["_max_length"]
ss._distance_algorithm=myData["_distance_algorithm"]
ss._max_dictionary_edit_distance=myData["_max_dictionary_edit_distance"]
ss._prefix_length=myData["_prefix_length"]
ss._count_threshold=myData["_count_threshold"]
ss._compact_mask=myData["_compact_mask"]

It is not working. It is loading, but spell correction is not working.

As a workaround, i added following two functions in main file, which are working

    def save_words_with_freq_as_json(self,filename,encoding="utf8"):
        print('Saving dictionary...')
        with open(filename, 'w',encoding=encoding) as fp:
            json.dump(self._words, fp)        
        print('Saved dictionary...')
        return;
    def load_words_with_freq_from_json_and_build_dictionary(self,filename,encoding="utf8"):
        print('Loading dictionary...')
        myData = dict()

        with open(filename, 'r',encoding=encoding) as fp:
            myData = json.load(fp)
        for word in myData:
            self._create_dictionary_entry(word,myData[word])        
        print('Loaded dictionary...')

To use it, you can save like

filename = 'SymSpell_Dctionary_Word.json'
ss.save_words_with_freq_as_json(filename,encoding='ISO-8859-1');

and load like


ss = SymSpell(max_dictionary_edit_distance=3)
filename = 'SymSpell_Dctionary_Word.json'
ss.load_words_with_freq_from_json_and_build_dictionary(filename,encoding='ISO-8859-1');

Above is working, if anyone interested. But, if we have save and load deletes/words etc, it will be faster compared to training every time.

@ne3x7
Copy link
Owner

ne3x7 commented Apr 16, 2018

Thanks for your suggestion.

There is a simple load_dictionary() method to read whitespace-separated word-count pairs from file. The creation of this file is not implemented, but as you correctly noticed is basically dumping _words dictionary.

Saving and loading _deletes dictionary was not intended in original SymSpell, because it is well-optimized to build it fast. I did not estimate the speed yet, but it seems that building _deletes with load_dictionary() is fast enough.

@ne3x7 ne3x7 added the enhancement New feature or request label Apr 16, 2018
@sekarpdkt
Copy link
Contributor Author

sekarpdkt commented Apr 17, 2018

It was a simple issue. When we load JSON from file, keys are stored as string, where as _deletes keys are int (hash value). We need to do something like

for hs in sorted(myDeleteData):
    ss._deletes[int(hs)]=myDeleteData[hs]

and good news is its working. The key here is, you need to convert hs into int ([int(hs)]) while transferring it to ss._deletes. I will be sending some pull request. I also implemented multi threading for creating the hash table in python. With four threads, it took 3 mins, intead of 7 min with out multi thread for loading 500K words in my local machine

@sekarpdkt
Copy link
Contributor Author

Raised a pull request

@ne3x7
Copy link
Owner

ne3x7 commented Apr 20, 2018

Hi, thanks for the contribution, I appreciate it a lot. I will look through shortly and accept.

Do you want to join forces to further improve it?

@sekarpdkt
Copy link
Contributor Author

Definitely would like to join.. I am now working on some more improvement. Will let you know once it is done. But, if you are merging my changes, revert the prime number back to original. I was not aware of specialty of those two numbers :-) FNV Hash algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants