Merge branch 'optimized'

Optimize by using generators, so the possible typos don't have to be stored in memory. For long english words: %timeit spell('disproporttionatelly') before: 1 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after: 762 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit spell('indistimguishabble') before: 821 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each after: 619 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) More importantly, I'm working on a support for polish, where alphabet is larger, and words tend to be longer, so the change is significant: %timeit spell('gżegrzółka') before: 1.51 s ± 67.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after: 370 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit spell('anarchokolektuwistycznychh') before: 3.83 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after: 2.2 s ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) It also solves issue phatpiglet#18 Before, running spell('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') would consume all RAM it could. Now it takes only 8KB more than idle, and won't freeze. When using pypy, difference gets even more dramatic: %timeit spell('disproporttionatelly') before: 668 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after: 377 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit spell('indistimguishabble') before: 585 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after: 330 ms ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit spell('gżegrzółka') before: Gets killed because it eats up too much RAM before: 166 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit spell('anarchokolektuwistycznychh') before: Gets killed because it eats up too much RAM after: 994 ms ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
filyp · Sep 15, 2019 · 47d343e · 47d343e
2 parents 3f405d2 + 9fc62c5
commit 47d343e
Show file tree

Hide file tree

Showing 2 changed files with 26 additions and 18 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+*.pyc
diff --git a/autocorrect/word.py b/autocorrect/word.py
@@ -41,52 +41,59 @@ def __init__(self, word):
 
     def _deletes(self):
         """th"""
-        return {concat(a, b[1:])
-                for a, b in self.slices[:-1]}
+        for a, b in self.slices[:-1]:
+            yield concat(a, b[1:])
 
     def _transposes(self):
         """teh"""
-        return {concat(a, reversed(b[:2]), b[2:])
-                for a, b in self.slices[:-2]}
+        for a, b in self.slices[:-2]:
+            yield concat(a, reversed(b[:2]), b[2:])
 
     def _replaces(self):
         """tge"""
-        return {concat(a, c, b[1:])
-                for a, b in self.slices[:-1]
-                for c in ALPHABET}
+        for a, b in self.slices[:-1]:
+            for c in ALPHABET:
+                yield concat(a, c, b[1:])
 
     def _inserts(self):
         """thwe"""
-        return {concat(a, c, b)
-                for a, b in self.slices
-                for c in ALPHABET}
+        for a, b in self.slices:
+            for c in ALPHABET:
+                yield concat(a, c, b)
 
     def typos(self):
         """letter combinations one typo away from word"""
-        return (self._deletes() | self._transposes() |
-                self._replaces() | self._inserts())
+        yield from self._deletes()
+        yield from self._transposes()
+        yield from self._replaces()
+        yield from self._inserts()
 
     def double_typos(self):
         """letter combinations two typos away from word"""
-        return {e2 for e1 in self.typos()
-                for e2 in Word(e1).typos()}
+        for e1 in self.typos():
+            for e2 in Word(e1).typos():
+                yield e2
 
 
 def common(words):
     """{'the', 'teh'} => {'the'}"""
-    return set(words) & NLP_WORDS
+    return set(word for word in words
+                if word in  NLP_WORDS)
 
 def exact(words):
     """{'Snog', 'snog', 'Snoddy'} => {'Snoddy'}"""
-    return set(words) & MIXED_CASE
+    return set(word for word in words
+                if word in MIXED_CASE)
 
 def known(words):
     """{'Gazpacho', 'gazzpacho'} => {'gazpacho'}"""
-    return {w.lower() for w in words} & KNOWN_WORDS
+    return set(word.lower() for word in words
+                if word.lower() in KNOWN_WORDS)
 
 def known_as_lower(words):
     """{'Natasha', 'Bob'} => {'bob'}"""
-    return {w.lower() for w in words} & LOWERCASE
+    return set(word.lower() for word in words
+                if word.lower() in LOWERCASE)
 
 def get_case(word, correction):
     """