-
Notifications
You must be signed in to change notification settings - Fork 75
Integrating Stanford NLP
Author(s): Feng Hong, Yang Jiao
##Synopsys Stanford NLP package is a very powerful Java software for natural language processing. The goal is to integrate some of its features as an operator to allow users to extract Named Entities or Part of speeches.
As of 6/13/2016: FINISHED
edu.uci.ics.textdb.dataflow.nlpextractor
https://github.com/TextDB/textdb/issues/33
##Stanford NLP package
Stanford NLP is a set of natural language analysis tools written in Java, which annotate raw human language tokens and output forms of words, their part of speech (whether they are names of companies, people, location, etc.). The package includes a POS tagger, a syntactic parser, and a named entity recognizer. Its analyses provide the foundational building blocks for higher-level and domain-specific text-understanding applications.
The purpose of this project is to implement Stanford NLP as an extractor in TextDB. We allow users to specify the NLP constant including 7 Named Entity classes and 4 types of Part of Speech entity: Number, Location, Person, Organization, Money, Percent, Date, Time, Adjective, Adverb, Noun, Verb.
- Common usage of Stanford NLP package:
- Name Entity Recognition: For example, names(PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET).
- Lemmatization
- Part-of-Speech: Determine if a word is a noun, verb, adjective, etc.
##Presentation Slides
4/11/2016 Presentation: Project Overview
4/18/2016 Presentation: StanfordNPL introduction
4/25/2016 Presentation: [Status Report] (https://docs.google.com/presentation/d/1ek18Zr0OqQ0RONj8D7W2aSGs9sz1etnf9bEnWTEA2ag/edit?usp=sharing)
Machine setting: Macbook Pro (Late-2015), Intel Core i5, SSD hard drive, 8GB memory.
- Data set: 100k Medline records, about 150 MB
- Performance results (average time reported in seconds):
All NamedEntities | Part of Speech | |
---|---|---|
NlpExtractor | 2937s | 209s |
-
On average: 34 Documents/sec for Named Entities Recognition and 480 Docs/sec for Part of Speech Recognition
-
Data set: 1M Medline records, about 1.5G
All NamedEntities | Part of Speech | |
---|---|---|
NlpExtractor | Too Slow | 2110s |
- On the average, about 500 Docs/sec for Part of Speech Recognition. Slow on Named Entities Recognition.
- According to the performance test, the Named Entities extraction runs really slow. Future optimization is needed to make it faster.
##Stanford NLP package License: GNU General Public License