Skip to content

Integrating Stanford NLP

Feng edited this page Jul 7, 2016 · 4 revisions

Author(s): Feng Hong, Yang Jiao

##Synopsys Stanford NLP package is a very powerful Java software for natural language processing. The goal is to integrate some of its features as an operator to allow users to extract Named Entities or Part of speeches.

Status

As of 6/13/2016: FINISHED

Modules

edu.uci.ics.textdb.dataflow.nlpextractor

Related Issues

https://github.com/TextDB/textdb/issues/33

##Stanford NLP package

Stanford NLP is a set of natural language analysis tools written in Java, which annotate raw human language tokens and output forms of words, their part of speech (whether they are names of companies, people, location, etc.). The package includes a POS tagger, a syntactic parser, and a named entity recognizer. Its analyses provide the foundational building blocks for higher-level and domain-specific text-understanding applications.

The purpose of this project is to implement Stanford NLP as an extractor in TextDB. We allow users to specify the NLP constant including 7 Named Entity classes and 4 types of Part of Speech entity: Number, Location, Person, Organization, Money, Percent, Date, Time, Adjective, Adverb, Noun, Verb.

  • Common usage of Stanford NLP package:
  1. Name Entity Recognition: For example, names(PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET).
  2. Lemmatization
  3. Part-of-Speech: Determine if a word is a noun, verb, adjective, etc.

##Presentation Slides

4/11/2016 Presentation: Project Overview

4/18/2016 Presentation: StanfordNPL introduction

4/25/2016 Presentation: [Status Report] (https://docs.google.com/presentation/d/1ek18Zr0OqQ0RONj8D7W2aSGs9sz1etnf9bEnWTEA2ag/edit?usp=sharing)

Performance Test

Machine setting: Macbook Pro (Late-2015), Intel Core i5, SSD hard drive, 8GB memory.

  • Data set: 100k Medline records, about 150 MB
  • Performance results (average time reported in seconds):
All NamedEntities Part of Speech
NlpExtractor 2937s 209s
  • On average: 34 Documents/sec for Named Entities Recognition and 480 Docs/sec for Part of Speech Recognition

  • Data set: 1M Medline records, about 1.5G

All NamedEntities Part of Speech
NlpExtractor Too Slow 2110s
  • On the average, about 500 Docs/sec for Part of Speech Recognition. Slow on Named Entities Recognition.

TODOs

  • According to the performance test, the Named Entities extraction runs really slow. Future optimization is needed to make it faster.

##Stanford NLP package License: GNU General Public License

Clone this wiki locally