Skip to content

Latest commit

 

History

History
157 lines (109 loc) · 14.1 KB

README.md

File metadata and controls

157 lines (109 loc) · 14.1 KB

OpenNLP.Similarity Component

It is a project under Apache OpenNLP which subjects results of parsing, part-of-speech tagging and rhetoric parsing to machine learning. It is leveraged in search, content generation & enrichment, chatbots and other text processing domains where relevance assessment task is a key.

What is OpenNLP.Similarity?

OpenNLP.Similarity is an NLP engine which solves a number of text processing and search tasks based on OpenNLP and Stanford NLP parsers. It is designed to be used by a non-linguist software engineer to build linguistically-enabled:

  • search engines
  • recommendation systems
  • dialogue systems
  • text analysis and semantic processing engines
  • data-loss prevention system
  • content & document generation tools
  • text writing style, authenticity, sentiment, sensitivity to sharing recognizers
  • general-purpose deterministic inductive learner equipped with abductive, deductive and analogical reasoning which also embraces concept learning and tree kernel learning.

OpenNLP similarity provides a series of techniques to support the overall content pipeline, from text collection to cleaning, classification, personalization and distribution. Technology and implementation of content pipeline developed at eBay is described here.

Installation

  1. Do git clone to set up the environment including resources. Besides what you get from git, /resources directory requires some additional work:

  2. Download the main jar.

  3. Set all necessary jars in /lib folder. Larger size jars are not on git so please download them from Stanford NLP site

  • edu.mit.jverbnet-1.2.0.jar
  • ejml-0.23.jar
  • joda-time.jar
  • jollyday.jar
  • stanford-corenlp-3.5.2-models.jar
  • xom.jar
  • The rest of jars are available via maven.
    1. Set up src/test/resources directory
    • new_vn.zip needs to be unzipped
    • OpenNLP models need to be downloaded into the directory 'models' from here

    As a result the following folders should be in /resources: As obtained from git:

  • /new_vn (VerbNet)
  • /maps (some lookup files such as products, brands, first names etc.)
  • /external_rst (examples of import of rhetoric parses from other systems)
  • /fca (Formal Concept Analysis learning)
  • /taxonomies (for search support, taxonomies are auto-mined from the web)
  • /tree_kernel (for tree kernel learning, representation of parse trees, thickets and trained models)
  • Manual downloading is also required for:
  • /new_vn
  • /w2v (where word2vector model needs to be downloaded, if desired)
    1. Try running tests which will give you a hint on how to integrate OpenNLP.Similarity functionality into your application. You can start with Matcher test and observe how long paragraphs can be linguistically matched (you can compare this with just an intersection of keywords)

    2. Look at example POMs for how to better integrate into your existing project

    Creating a simple project

    Create a project from MyMatcher.java.

    Engines and Systems of OpenNLP.Similarity

    Main relevance assessment function

    It takes two texts and returns the cardinality of a maximum common subgraph representations of these texts. This measure is supposed to be much more accurate than keyword statistics, compositional semantic models word2vec because linguistic structure is taken into account, not just co-occurrences of keywords. Matching class in [matching package] (https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/main/java/opennlp/tools/parse_thicket/matching) has

    List<List<ParseTreeChunk>> assessRelevance(String para1, String para2)

    function which returns the list of [common phrases between these paragraph]s.

    To avoid reparsing the same strings and improve the speed, use

    List<List<ParseTreeChunk>> assessRelevanceCache(String para1, String para2)

    It operates on the level of sentences (giving maximal common subtree) and paragraphs (giving maximal common sub-parse thicket). Maximal common sub-parse thicket is also represented as a list of common phrases.

  • Search results re-ranker based on linguistic similarity
  • Request Handler for SOLR which used parse tree similarity
  • Search engine

    The following set of functionalities is available to enable search with linguistic features. It is desirable when query is long (more than 4 keywords), logically complex, ambiguous or

  • Search results re-ranker based on linguistic similarity
  • Request Handler for SOLR which used parse tree similarity
  • Taxonomy builder via learning from the web
  • Appropriate rhetoric map of an answer verifier. If parts of the answer are located in distinct discourse units, this answer might be irrelevant even if all keywords are mapped
  • Tree kernel learning re-ranker to improve search relevance within a given domain with pre-trained model
  • SOLR request handlers are available here

    Taxonomy builder is here. Examples of pre-built taxonomy are available in this directory. Please pay attention at taxonomies built for languages other than English. A music taxonomy is an example of the seed data for taxonomy building, and this taxonomy hashmap dump is a good example of what can be automatically constructed. A paper on taxonomy learning is here.

    Search results re-ranker

    Re-ranking scores similarity between a given orderedListOfAnswers and question

    List<Pair<String,Double>> pairList = new ArrayList<Pair<String,Double>>();

    for (String ans: orderedListOfAnswers) {

            `List<List<ParseTreeChunk>> similarityResult = m.assessRelevanceCache(question, ans);`
            
            `double score = parseTreeChunkListScorer.getParseTreeChunkListScoreAggregPhraseType(similarityResult);`
            
            `Pair<String,Double> p = new Pair<String, Double>(ans, score);`
            
            `pairList.add(p);`
            
        `}`
    

    Collections.sort(pairList, Comparator.comparing(p -> p.getSecond()));

    Then pairList is then ranked according to the linguistic relevance score. This score can be combined with other sources such as popularity, geo-proximity and others.

    Content generator

    It takes a topic, builds a taxonomy for it and forms a table of content. It then mines the web for documents for each table of content item, finds relevant sentences and paragraphs and merges them into a document package. The resultant document has a TOC, sections, figures & captions and also a reference section. We attempt to reproduce how humans cut-and-paste content from the web while writing on a topic. Content generation has a demo and to run it from IDE start here. Examples of written documents are here. Another content generation option is about opinion data. Reviews are mined for, cross-bred and made "original" for search engines. This and general content generation is done for SEO purposes. Review builder composes fake reviews which are in turn should be recognized by a Fake Review detector

    Text classifier / feature detector in text

    The classifier code is the same but the model files vary for the applications below:

  • detect security leaks
  • detect argumentation
  • detect low cohesiveness in text
  • detect authors’ doubt and low confidence
  • detect fake review

    Document classification to six major classes {finance, business, legal, computing, engineering, health} is available via nearest neighbor model. A Lucene training model (1G file) is obtained from Wikipedia corpus. This classifier can be trained for an arbitrary classes once respective Wiki pages are selected and respective Lucene index is built. Once proper training documents are selected from Wikipedia with adequate coverage, the accuracy is usually higher than can be achieved by word2vec classification models.

    General-purpose deterministic inductive learner implements JS Mills method of induction and abduction (deduction is also partially implemented).

    Inductive learning implemented as a base for syntactic tree-based learning is similar to the family of approaches such as Explanation-based Learning and Inductive Logic Programming.

    Tree-kernel learning

    is integrated to allow application of SVM learning to sentence-level and paragraph-level linguistic data including discourse. Unlike learning in numerical space, each dimension in tree kernel learning is an occurrence of a particular subtree. Similarity is not a numerical distance but a count of common subtrees. A set of parse trees for individual sentences to represent a paragraph is called parse thicket. Its representation as a graph is coded in a tree representation via parenthesis such as [model*.txt] (https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/src/test/resources/tree_kernel/model_pos_neg_sentiment.txt). To do model building and predictions, C modules are run in this directory, so proper choice need to be made: {svm_classify.linux, svm_classify.max, svm_classify.exe, svm_learn.*}. Also, proper run permissions needs to be set for these files.

    Concept learning

    is a branch of deterministic learning which is applied to attribute-value pairs and possesses useful explainability feature, unlike statistical and deep learning. It is fairly useful for data exploration and visualization since all interesting relations can be visualized. Concept learning covers inductive and abductive learning and also some cases of deduction. Explore this package for the concept learning-related features.

    Filtering results for Speech Recognition based on semantic meaningfulness

    It takes results from a speech-to-text system and subjects them to [filtering] (https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/src/main/java/opennlp/tools/similarity/apps/SpeechRecognitionResultsProcessor.java). Those recognized candidate words which do not make sense together are filtered out, based on the frequency of co-occurrences found on the web.

    Related Research

    Here's the link to the book on question-answering

    and research papers.

    also the recent book related to reasoning and linguistics in humans & machines

    Configuring OpenNLP.Similarity component

    VerbNet model is included by default, so that the hand-coded meanings of the verb are used when similarity between verb phrases are computed.

    To include word2vector model, download it and make sure the following path is valid: resourceDir + "/w2v/GoogleNews-vectors-negative300.bin.gz"