From 4d96840dfaedc9a661f8093957a9efca0e77605b Mon Sep 17 00:00:00 2001 From: Abhishek Singh Date: Sun, 17 Jun 2018 15:02:54 -0700 Subject: [PATCH] Documentation --- README.md | 11 ++-- README.md~ | 175 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 181 insertions(+), 5 deletions(-) create mode 100644 README.md~ diff --git a/README.md b/README.md index 9371a50..5f6ae7c 100644 --- a/README.md +++ b/README.md @@ -27,7 +27,7 @@ Both Trie and DAWG are Finite State Automaton(FSA) ```commandline pip install lexpy ``` -For versions older than 0.9.3, there is a problem in the package distribution +>For versions older than 0.9.3, there is a problem in the package distribution which has been resolved now in 0.9.3. I apologize, if that frustrated anyone. Lexpy version `0.9.3` is recommended and it supports both Python 2 and Python 3. @@ -45,7 +45,7 @@ Lexpy version `0.9.3` is recommended and it supports both Python 2 and Python 3. # Examples -Although, the examples below are shown only for trie, they can be used for a DAWG in the same way. +>Although, the examples below are shown only for trie, they can be used for a DAWG in the same way. Both Trie and DAWG support the same set of operations as shown in the above table. However, do read the section on "DAWG". @@ -111,8 +111,8 @@ print(trie.search_with_prefix('ab')) 3. Wildcard search using `?` and `*` -`?` = 0 or 1 occurance of any character -`*` = 0 or more occurance of any character + a.`?` = 0 or 1 occurance of any character + b.`*` = 0 or more occurance of any character ```python print(trie.search('a*o*')) @@ -120,12 +120,13 @@ print(trie.search('a*o*')) print(trie.search('su?t')) [u'suit'] +``` 4. Search for similar words using the notion of Levenstien Distance(LD) +```python print(trie.search_within_distance('arie', dist=2)) ['arbil', 'auric', 'athie'] - ``` # Directed Acyclic Word Graph (DAWG) diff --git a/README.md~ b/README.md~ new file mode 100644 index 0000000..9371a50 --- /dev/null +++ b/README.md~ @@ -0,0 +1,175 @@ +![Logo](https://github.com/aosingh/lexpy/blob/master/images/lexpylogo.png) + +[![PyPI version](https://badge.fury.io/py/lexpy.svg)](https://pypi.python.org/pypi/lexpy) +[![Travis](https://travis-ci.org/aosingh/lexpy.svg?branch=master)](https://travis-ci.org/aosingh/lexpy) +[![Build status](https://ci.appveyor.com/api/projects/status/hib5wm4qo2oop3ui?svg=true)](https://ci.appveyor.com/project/aosingh/lexpy) +[![Coverage Status](https://coveralls.io/repos/github/aosingh/lexpy/badge.svg?branch=master)](https://coveralls.io/github/aosingh/lexpy?branch=master) +[![Maintainability](https://api.codeclimate.com/v1/badges/60626f81c0db0c5d8dcd/maintainability)](https://codeclimate.com/github/aosingh/lexpy/maintainability) + +[![Python 2.7](https://img.shields.io/badge/python-2.7-blue.svg)](https://www.python.org/downloads/release/python-270/) +[![Python 3.3](https://img.shields.io/badge/python-3.3-blue.svg)](https://www.python.org/downloads/release/python-330/) +[![Python 3.4](https://img.shields.io/badge/python-3.4-blue.svg)](https://www.python.org/downloads/release/python-340/) +[![Python 3.5](https://img.shields.io/badge/python-3.5-blue.svg)](https://www.python.org/downloads/release/python-350/) +[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/) + + +>A lexicon is a data-structure which stores a set of words. The difference between +a dictionary and a lexicon is that in a lexicon there are no values associated with the words. A lexicon is similar to a list of words or a set, but the internal representation is different and optimized +for faster searches(of words, prefixes and wildcard patterns). Precisely the search time is O(W) where W is the length of the word. + +2 important Lexicon data-structures are: +1. Trie. +3. Directed Acyclic Word Graph(DAWG). + +Both Trie and DAWG are Finite State Automaton(FSA) + +# Install +```commandline +pip install lexpy +``` +For versions older than 0.9.3, there is a problem in the package distribution +which has been resolved now in 0.9.3. I apologize, if that frustrated anyone. +Lexpy version `0.9.3` is recommended and it supports both Python 2 and Python 3. + +# Interface + +| **Interface Description** | **Trie** method | **DAWG** method | +|------------------------------------------------------------------------------------------------------------------------------- |------------------------------------------- |------------------------------------------- | +| Add a single word | `add('apple')` | `add('apple')` | +| Add multiple words | `add_all(['advantage', 'courage'])` | `add_all(['advantage', 'courage'])` | +| Check if exists? | `in` operator | `in` operator | +| Search using wildcard expression | `search('a?b*')` | `search('a?b*)` | +| Search for prefix matches | `search_with_prefix('bar')` | `search_with_prefix('bar')` | +| Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance (LD) | `search_within_distance('apble', dist=1)` | `search_within_distance('apble', dist=1)` | + + +# Examples + +Although, the examples below are shown only for trie, they can be used for a DAWG in the same way. +Both Trie and DAWG support the same set of operations as shown in the above table. + +However, do read the section on "DAWG". + +## Ways to build a Trie or a DAWG. + +1. From an input list, set, or tuple of words. + +```python +from lexpy.trie import Trie +trie = Trie() +input_words = [ + 'ampyx', 'abuzz', 'athie', 'amato', 'aneto', 'aruba', 'arrow', 'agony', 'altai', 'alisa', + 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin', 'almug', 'artha', 'algin', 'auric', + 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit', 'tank', 'common', 'lonely', 'likeable' + 'language', 'shock', 'look', 'pet', 'dime', 'small' 'dusty', 'accept', 'nasty', 'thrill', + 'foot', 'steel' +] + +trie.add_all(input_words) # You can pass any sequence types of a file like object here + +print trie.get_word_count() +40 +``` + +2. Use the `build_trie_from_file()` method + +```python +from lexpy.utils import build_trie_from_file +trie = build_trie_from_file('/path/to/file') + +``` + +3. From a file-like object. +```python + +from lexpy.trie import Trie + +# Either +trie.add_all('/path/to/file.txt') + +# Or +with open('/path/to/file.txt', 'r') as infile: + trie.add_all(infile) + +``` + +## Search + +1. Check if exists using the `in` operator + +```python +print 'ampyx' in trie +True +``` + +2. Prefix search + +```python +print(trie.search_with_prefix('ab')) +['abhor', 'abuzz'] +``` + +3. Wildcard search using `?` and `*` + +`?` = 0 or 1 occurance of any character +`*` = 0 or more occurance of any character + +```python +print(trie.search('a*o*')) +[u'acorn', u'abhor', u'agony', u'amato', u'aneto', u'arrow'] + +print(trie.search('su?t')) +[u'suit'] + +4. Search for similar words using the notion of Levenstien Distance(LD) + +print(trie.search_within_distance('arie', dist=2)) +['arbil', 'auric', 'athie'] + +``` + +# Directed Acyclic Word Graph (DAWG) + +>DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is always +less than or equal to the number of nodes in Trie. They both are Deterministic Finite State Automata. +However, DAWG is a minimized version of the Trie DFA. +In a Trie, prefix redundancy is removed. +In a DAWG, both prefix and suffix redundancies are removed. + +In the current implementation of DAWG, the insertion order of the words should be **alphabetical**. + + +```python +from lexpy.trie import Trie +from lexpy.dawg import DAWG + +trie = Trie() +trie.add_all(['advantageous', 'courageous']) + +dawg = DAWG() +dawg.add_all(['advantageous', 'courageous']) + +len(trie) # Number of Nodes in Trie +23 + +dawg.reduce() # Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG. + +len(dawg) # Number of nodes in DAWG +13 + +``` + + + +*Fun Facts* : +1. The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary. +So for all english words, the search time is bounded by O(45). +2. The longest technical word(not in dictionary) is the name of a protein called as [titin](https://en.wikipedia.org/wiki/Titin). It has 189,819 +letters and it is disputed whether it is a word. + + + + + + +