Text Summarization

Using extractive and abstractive approaches to create summaries

This is the repository for my capstone project at Springboard Machine Learning bootcamp

1. Project description

The objective of this project is to develop a text summarization tool able to create a short version of a given document retaining it most important information. This task is relevant for to access textual information and produce digests of news, social media and reviews. It can also be applied as part of other AI tasks such as answering questions and providing recommendations.

Dataset: The CNN news highlights dataset, which contains news articles and associated highlights, i.e., a few bullet points giving a brief overview of the article, with 92,579 documents.

The CNN dataset was downloaded from New York University, in the version made available by Kyunghyun Cho, which can be found here

A description of this project development can be found on my portfolio website,

2. Data cleaning

Basic processing of the original dataset file separting article from summaries.

Notebook: 01-process-raw-data.ipynb [launch notebook on Codelab]

3. Exploratory Data Analysis (EDA)

Analysis of number of characteres, words and sentences on both articles and summaries. Identification of malformed articles and cleaning the dataset from them.

Notebook: 02-exploratory-data-analysis.ipynb [launch notebook on Codelab]

4. Algorithms

For the extractive approach, we used a sentence scoring algorithm, which was mostly based on Alfrick Opidi's article on Floydhub, named "A Gentle Introduction to Text Summarization in Machine Learning".

Notebook: 03-sentence-scoring-algorithm.ipynb [launch notebook on Codelab]

For the abstrative approach, we used a machine learning RNN seq-2-seq model, which was originally inspired on the translation algorithm proposed by Trung Tran in the blog post Neural Machine Translation With Attention Mechanism

Notebook: train_model.ipynb [launch notebook on Codelab]

5. Serving the model

HTTP POST calls to the extractive model API

Format:

curl -X POST --data-binary @<filename> -d 'tokenizer=\<stem | lemma\>&n_gram=<1-gram |2-gram | 3-gram>&threshold_factor=<float>' https&#58;//summarizer-lopasso&#46;herokuapp&#46;&#8203;com/predict

HTTP POST calls to the abstractive model API

Format:

curl --data-binary @<filename> https://us-central1-data-engineering-gcp.cloudfunctions.net/summarizer

For both approaches (extractive and abstractive) the response is a JSON in the following format:

{"prediction" : "The generated summary"}

Web interface

Access the app hosted by a GCP instance using streamlit. link. The app has a self explanatory page, where the inputs are the text to be summarized and the algorithm parameters. The generated summary appears in the field on the bottom of the page, when the button "Submit" is pressed.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
abstractive		abstractive
app-summarizer		app-summarizer
data		data
extractive/notebooks		extractive/notebooks
sentence_score_api_files		sentence_score_api_files
sentence_scoring_api		sentence_scoring_api
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Summarization

Using extractive and abstractive approaches to create summaries

Contents

1. Project description

2. Data cleaning

3. Exploratory Data Analysis (EDA)

4. Algorithms

5. Serving the model

HTTP POST calls to the extractive model API

HTTP POST calls to the abstractive model API

Web interface

About

Releases

Packages

License

glopasso/text-summarization

Folders and files

Latest commit

History

Repository files navigation

Text Summarization

Using extractive and abstractive approaches to create summaries

Contents

1. Project description

2. Data cleaning

3. Exploratory Data Analysis (EDA)

4. Algorithms

5. Serving the model

HTTP POST calls to the extractive model API

HTTP POST calls to the abstractive model API

Web interface

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages