Text summarization, a vital aspect of data science and natural language processing, involves condensing document size while retaining meaning. This project centers on abstractive summarization, specifically utilizing the BART model to generate concise summaries that may introduce new phrases not present in the original text. Applications span diverse domains, including science, literature, finance, legal analysis, meetings, video conferencing, and programming languages.
The dataset comprises 40,000 professionally crafted summaries of news articles, alongside links to the original articles. Sourced from a GitHub repository, the data is formatted in CSV, encompassing features such as article titles, summaries, URLs, dates, and article content.
To execute abstractive text summarization on the given data using the BART model.
- Language:
Python
- Libraries:
pandas
,scikit-learn
,PyTorch
,Transformers
- Environment:
Google Colab
- Import the dataset using the dataset library and load a subset for initial data exploration.
- Clone the repository housing the data.
- Download article titles, summaries, URLs, and dates into a CSV file.
- Set up a new environment, install required dependencies, and scrape the data.
- Configure the runtime to utilize GPU for enhanced processing.
- Import necessary packages and libraries.
- Develop a class function for the dataset.
- Create a class function for the BART data loader.
- Implement a class function for the abstractive summarization model.
- Establish a BART tokenizer.
- Define the data loader.
- Read and preprocess the data.
- Split the data into training and testing sets.
- Construct the main class for executing the 'BARTForConditionalGeneration' model and tokenizer.
- Define the trainer class and train the model.
- Execute BART summarization leveraging the pre-trained model.
- Grasp the concept of the BART evaluation metric - Rouge.
- For web application deployment:
- Set up a new environment.
- Install necessary packages from requirements.txt.
- Navigate to the output folder.
- Run app.py.
- Access the web application locally on port 5000.
- Input an article link for summarization, and the generated summary will be displayed.