Skip to content

Latest commit

 

History

History
37 lines (26 loc) · 1.01 KB

README.md

File metadata and controls

37 lines (26 loc) · 1.01 KB

Wshiml

Implementation of http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html

Build requires oasis. Do:

./configure    # optionally with --prefix
make
make install

To build the example command-line program, do

./configure --enable-cli
make
make install
find-similar-docs --help

The command-line program requires cmdliner. The rest of the software has no dependencies apart from Oasis for building from git.

On Debian/Ubuntu, you can install all build dependencies with

sudo apt install oasis libcmdliner-ocaml-dev

So far the code is fairly unoptimised apart from what's described in http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html and uses 7s (4s with super-shingling) to cluster 1100 documents of altogether 766,937 words on an old 2.8 GHz AMD.

API documentation

is here.