Integrating Stanford NLP

Author(s): Feng Hong, Yang Jiao

##Synopsys Stanford NLP package is a very powerful Java software for natural language processing. The goal is to integrate some of its features as an operator to allow users to extract Named Entities or Part of speeches.

Status

As of 6/13/2016: FINISHED

Modules

edu.uci.ics.textdb.dataflow.nlpextractor

Related Issues

https://github.com/TextDB/textdb/issues/33

##Stanford NLP package

Stanford NLP is a set of natural language analysis tools written in Java, which annotate raw human language tokens and output forms of words, their part of speech (whether they are names of companies, people, location, etc.). The package includes a POS tagger, a syntactic parser, and a named entity recognizer. Its analyses provide the foundational building blocks for higher-level and domain-specific text-understanding applications.

The purpose of this project is to implement Stanford NLP as an extractor in TextDB. We allow users to specify the NLP constant including 7 Named Entity classes and 4 types of Part of Speech entity: Number, Location, Person, Organization, Money, Percent, Date, Time, Adjective, Adverb, Noun, Verb.

Common usage of Stanford NLP package:

Name Entity Recognition: For example, names(PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET).
Lemmatization
Part-of-Speech: Determine if a word is a noun, verb, adjective, etc.

##Presentation Slides

4/11/2016 Presentation: Project Overview

4/18/2016 Presentation: StanfordNPL introduction

4/25/2016 Presentation: [Status Report] (https://docs.google.com/presentation/d/1ek18Zr0OqQ0RONj8D7W2aSGs9sz1etnf9bEnWTEA2ag/edit?usp=sharing)

Performance Test

Machine setting: Macbook Pro (Late-2015), Intel Core i5, SSD hard drive, 8GB memory.

Data set: 100k Medline records, about 150 MB
Performance results (average time reported in seconds):

	All NamedEntities	Part of Speech
NlpExtractor	2937s	209s

On average: 34 Documents/sec for Named Entities Recognition and 480 Docs/sec for Part of Speech Recognition
Data set: 1M Medline records, about 1.5G

	All NamedEntities	Part of Speech
NlpExtractor	Too Slow	2110s

On the average, about 500 Docs/sec for Part of Speech Recognition. Slow on Named Entities Recognition.

TODOs

According to the performance test, the Named Entities extraction runs really slow. Future optimization is needed to make it faster.

##Stanford NLP package License: GNU General Public License

Overview of Wiki

Videos

Pubs, Talks, and Courses

Step 1 - Guide to Use Texera

Step 2 - Guide for Developers

Step 3 - Guide to Implement a Java Native Operator

Step 4 - Guide to Use a Python UDF

Step 5 - Guide to Implement a Python Native Operator

Step 6 - Guide to Raise a Pull Request (PR)

Interesting and Important reads

Contributors

Provide feedback

Saved searches