So far in this series, we have looked at finatra and sbt open-source Scala projects. This week I decided to learn Stanford CoreNLP library for performing sentiment analysis of unstructured text in Scala.
Sentiment analysis or opinion mining is a field that uses natural language processing to analyze sentiments in a given text. It has applications in many domains ranging from marketing to customer service. Few years back, I wrote a simple Java application using Naive Bayes classifier to determine whether people liked a movie or not based on sentiment analysis of tweets about a movie.
From the Stanford CoreNLP website,
Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.
The code for today’s demo application is available on github: sentiment-analyzer.
Start by creating a new directory sentiment-analyzer
at a convenient location on your filesystem. This directory will house the source code of our application.
$ mkdir sentiment-analyzer
Create a new file build.sbt
inside the sentiment-analyzer
directory. build.sbt
is the sbt build script.
If you are new to sbt, then please refer to my earlier post on it.
Populate build.sbt
with following contents.
name := "sentiment-analyzer"
description := "A demo application to showcase sentiment analysis using Stanford CoreNLP and Scala"
version := "0.1.0"
scalaVersion := "2.11.7"
libraryDependencies += "edu.stanford.nlp" % "stanford-corenlp" % "3.5.2" artifacts (Artifact("stanford-corenlp", "models"), Artifact("stanford-corenlp"))
libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6" % "test"
One thing that you might not understand in the above mentioned build script is the usage of artifacts
. artifacts
is used when the dependency you have defined in your build script has published multiple artifacts. We have used artifacts
above to tell sbt that we need to include both stanford-corenlp
and stanford-models
dependencies in our classpath. stanford-corenlp
defines the core API that we will use in our code and stanford-models
contains all the data model files that stanford-corenlp
library uses underneath. stanford-models
library is 378.1 MB in size so sbt will take some time to download it.
Create a project layout for your Scala source and test files.
$ mkdir -p src/main/{scala,resources}
$ mkdir -p src/test/scala
The main part of the application is to analyze text for sentiments. We will write a sentiment analyzer in Scala that uses stanford-corenlp
API.
Let's start by writing a test case for positive sentiment. Create a new file SentimentAnalyzerSpec.scala
inside src/test/scala
directory. We are using scalatest
to write our test cases.
import org.scalatest.{FunSpec, Matchers}
class SentimentAnalyzerSpec extends FunSpec with Matchers {
describe("sentiment analyzer") {
it("should return POSITIVE when input has positive emotion") {
val input = "Scala is a great general purpose language."
val sentiment = SentimentAnalyzer.mainSentiment(input)
sentiment should be(Sentiment.POSITIVE)
}
}
The test case shown above calls the SentimentAnalyzer
API's mainSentiment
method. If the sentiment returned by SentimentAnalyzer
is Sentiment.POSITIVE
then the test will pass. The mainSentiment
method will return sentiment of the largest line of the text i.e. for input Scala is a great general purpose language. I don't use it often.
will return Sentiment.POSITIVE
as the longer line of the text Scala is a great general purpose language
has positive emotion.
Sentiment
is an enum that we have defined in our application.
object Sentiment extends Enumeration {
type Sentiment = Value
val POSITIVE, NEGATIVE, NEUTRAL = Value
def toSentiment(sentiment: Int): Sentiment = sentiment match {
case x if x == 0 || x == 1 => Sentiment.NEGATIVE
case 2 => Sentiment.NEUTRAL
case x if x == 3 || x == 4 => Sentiment.POSITIVE
}
}
Sentiment
is a Scala enum with a toSentiment
method defined. The toSentiment
method is used by SentimentAnalyzer
(discussed below) to convert integer sentiment value returned by stanford-corenlp
API to enum constant. The stanford-corenlp
library gives sentiment of 0 or 1 when text has negative emotion, 2 when text is neutral, 3 or 4 when text has positive emotion.
Let's now discuss about SentimentAnalyzer
. Full source code of SentimentAnalyzer
is shown below.
import java.util.Properties
import com.shekhargulati.sentiment_analyzer.Sentiment.Sentiment
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._
object SentimentAnalyzer {
val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
val pipeline: StanfordCoreNLP = new StanfordCoreNLP(props)
def mainSentiment(input: String): Sentiment = Option(input) match {
case Some(text) if !text.isEmpty => extractSentiment(text)
case _ => throw new IllegalArgumentException("input can't be null or empty")
}
private def extractSentiment(text: String): Sentiment = {
val (_, sentiment) = extractSentiments(text)
.maxBy { case (sentence, _) => sentence.length }
sentiment
}
def extractSentiments(text: String): List[(String, Sentiment)] = {
val annotation: Annotation = pipeline.process(text)
val sentences = annotation.get(classOf[CoreAnnotations.SentencesAnnotation])
sentences
.map(sentence => (sentence, sentence.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])))
.map { case (sentence, tree) => (sentence.toString,Sentiment.toSentiment(RNNCoreAnnotations.getPredictedClass(tree))) }
.toList
}
}
The code shown above does the following:
-
Creates an instance of
StanfordCoreNLP
.StanfordCoreNLP
internally constructs a pipeline that takes a text and returns various analyzed linguistic forms. The properties defines which all annotators will be used by the pipeline. For this applicationtokenize, ssplit, parse, sentiment
annotators will be used. -
The
mainSentiment
method checks if string is valid and if valid it calls theextractSentiment
with the input text. -
The
extractSentiment
method callsmaxBy
operation on a list of two value tuple returned byextractSentiments
. The tuple contains a sentence and sentiment. The maxBy operation compares values based on sentence length i.e. a sentence with largest length will be used as main sentiment of text. -
The
extractSentiments
method usesStanfordCoreNLP
to process the text. Theprocess
method processes the text by running the pipeline on the input text. We then get all the sentence annotations from theannotation
. As we have already importedscala.collection.convert.wrapAll._
so we can call all the usual Scala collection methods on thesentences
list. For each sentence annotation in the sentences annotation, we create a tuple of sentence text and sentiment value. Finally, we return the transformed tuple(tuple of sentence text and sentiment) list.
We can also write test cases for negative and neutral scenarios as shown below.
it("should return NEGATIVE when input has negative emotion") {
val input = "Dhoni laments bowling, fielding errors in series loss"
val sentiment = SentimentAnalyzer.mainSentiment(input)
sentiment should be(Sentiment.NEGATIVE)
}
it("should return NEUTRAL when input has no emotion") {
val input = "I am reading a book"
val sentiment = SentimentAnalyzer.mainSentiment(input)
sentiment should be(Sentiment.NEUTRAL)
}
We can also write another public method that just returns all the sentences and their sentiments as shown below.
def sentiment(input: String): List[(String, Sentiment)] = Option(input) match {
case Some(text) if !text.isEmpty => extractSentiments(text)
case _ => throw new IllegalArgumentException("input can't be null or empty")
}
That's all for this week. Please provide your valuable feedback by adding a comment to shekhargulati#5.