Skip to content

Latest commit

 

History

History
74 lines (53 loc) · 1.91 KB

README.MD

File metadata and controls

74 lines (53 loc) · 1.91 KB

jutsu.nlp

Clojure library meant for word embedding using the deeplearning4j library under the hood.

Word2Vec and Doc2Vec features are fully functional. Project is currently in a stable state.

API is subject to change in future versions.

Pull requests are welcome!

Usage

To install add this to your dependencies:

[hswick/jutsu.nlp "0.1.0"]

To use jutsu.nlp:

(:require '[jutsu.nlp.core :as nlp]
          '[jutsu.nlp.util :as util])

;;Configure your Word2Vec model
(def w2v (nlp/word-2-vec "path/to/text-file"
  {:min-word-frequency 5;;You can also input an option map
   :iterations 1		  ;;To set certain parameters
   :layer-size 100
   :seed 42
   :window-size 5}))

;;This trains the model on the data given
(nlp/fit! w2v)

;;Write the word2vec model to memory
(nlp/write-word-vectors w2v "word_vectors.csv")

;;Load a word2vec model from memory
(def w2v-2 (nlp/read-word-vectors (clojure.java.io/file "word_vectors.csv")))

;;If you want stopping and stemming initialize word2vec like this
(require '[jutsu.nlp.sentence-iterator :as iter]
         '[jutsu.nlp.tokenization :as token])

(nlp/word-2-vec
 (iter/default-iterator (util/absolute-path "neuromancer.txt"))
 (token/default-tokenizer-factory (token/common-stemmer-preprocessor))
 {:min-word-frequency 6
  :stopwords (nlp/stop-words)
  :window-size 10
  :layer-size 150})

;;If you want to input a directory initialize like this
(def w2v4 (nlp/word-2-vec
            (iter/dir-iterator "path/to/dir")
            (token/default-tokenizer-factory)
            {:min-word-frequency 6
             :window-size 10
             :layer-size 150
             :stopwords (nlp/stop-words)}))

Dev

Run boot night to startup nightlight and begin editing your project in a browser.

Run boot test-code to run tests.

License

Copyright © 2017 FIXME

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.