Skip to content
This repository has been archived by the owner on May 30, 2020. It is now read-only.

Commit

Permalink
Merge branch 'optimized'
Browse files Browse the repository at this point in the history
Optimize by using generators, so the possible typos don't have to be stored in memory.

For long english words:

%timeit spell('disproporttionatelly')
before: 1 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
after: 762 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit spell('indistimguishabble')
before: 821 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each
after: 619 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

More importantly, I'm working on a support for polish, where alphabet is larger, and words tend to be longer, so the change is significant:

%timeit spell('gżegrzółka')
before: 1.51 s ± 67.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
after: 370 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit spell('anarchokolektuwistycznychh')
before: 3.83 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
after: 2.2 s ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It also solves issue phatpiglet#18
Before, running spell('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
would consume all RAM it could. Now it takes only 8KB more than idle, and won't freeze.

When using pypy, difference gets even more dramatic:

%timeit spell('disproporttionatelly')
before: 668 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
after: 377 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit spell('indistimguishabble')
before: 585 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
after: 330 ms ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit spell('gżegrzółka')
before: Gets killed because it eats up too much RAM
before: 166 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit spell('anarchokolektuwistycznychh')
before: Gets killed because it eats up too much RAM
after: 994 ms ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Loading branch information
filyp committed Sep 15, 2019
2 parents 3f405d2 + 9fc62c5 commit 47d343e
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 18 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.pyc
43 changes: 25 additions & 18 deletions autocorrect/word.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,52 +41,59 @@ def __init__(self, word):

def _deletes(self):
"""th"""
return {concat(a, b[1:])
for a, b in self.slices[:-1]}
for a, b in self.slices[:-1]:
yield concat(a, b[1:])

def _transposes(self):
"""teh"""
return {concat(a, reversed(b[:2]), b[2:])
for a, b in self.slices[:-2]}
for a, b in self.slices[:-2]:
yield concat(a, reversed(b[:2]), b[2:])

def _replaces(self):
"""tge"""
return {concat(a, c, b[1:])
for a, b in self.slices[:-1]
for c in ALPHABET}
for a, b in self.slices[:-1]:
for c in ALPHABET:
yield concat(a, c, b[1:])

def _inserts(self):
"""thwe"""
return {concat(a, c, b)
for a, b in self.slices
for c in ALPHABET}
for a, b in self.slices:
for c in ALPHABET:
yield concat(a, c, b)

def typos(self):
"""letter combinations one typo away from word"""
return (self._deletes() | self._transposes() |
self._replaces() | self._inserts())
yield from self._deletes()
yield from self._transposes()
yield from self._replaces()
yield from self._inserts()

def double_typos(self):
"""letter combinations two typos away from word"""
return {e2 for e1 in self.typos()
for e2 in Word(e1).typos()}
for e1 in self.typos():
for e2 in Word(e1).typos():
yield e2


def common(words):
"""{'the', 'teh'} => {'the'}"""
return set(words) & NLP_WORDS
return set(word for word in words
if word in NLP_WORDS)

def exact(words):
"""{'Snog', 'snog', 'Snoddy'} => {'Snoddy'}"""
return set(words) & MIXED_CASE
return set(word for word in words
if word in MIXED_CASE)

def known(words):
"""{'Gazpacho', 'gazzpacho'} => {'gazpacho'}"""
return {w.lower() for w in words} & KNOWN_WORDS
return set(word.lower() for word in words
if word.lower() in KNOWN_WORDS)

def known_as_lower(words):
"""{'Natasha', 'Bob'} => {'bob'}"""
return {w.lower() for w in words} & LOWERCASE
return set(word.lower() for word in words
if word.lower() in LOWERCASE)

def get_case(word, correction):
"""
Expand Down

0 comments on commit 47d343e

Please sign in to comment.