Skip to content

Latest commit

 

History

History
16 lines (11 loc) · 1.41 KB

README.md

File metadata and controls

16 lines (11 loc) · 1.41 KB

NLP-Information-Extraction

Automated PDF and text processing; information extraction from text based on grammatical structure

General NLP on Text (Applied on Company Transcripts)

PDF Plumber extraction techniques; general data cleaning and boxplots of word count / densities; centroid words with TF-IDF and extractive summarisation by ranking; topic modelling and clustering; grammatical trends via dependencies and parts-of-speech

Keywords, Nouns and Topic Analysis (Applied to Patent Extracts)

Data preprocessing and word clouds over time periods; statistical analysis - keyword extraction with TF-IDF; comparison against RAKE, GENSIM, Spacy; topic modelling with Latent Dirichlet Analysis; Named Entity Recognition; nouns with Matcher and frequency/momentum analysis; noun pairing and network graphs

Generalised Research (Applied to Web3 Continuous News Extracts)

Exploratory Data Analysis - frequency-based histograms and subplots; Summarisation with TFIDF centroid vectors; text statistics with PCA, K-means clustering; word2vec; graph centrality; formation of n-grams / phrases

image image image