Skip to content

Long Document Summarization in the context of Multi-Speaker Webinar Transcripts with LangChain, Transformers and PEFT

Notifications You must be signed in to change notification settings

jolenechong/textSummarizerLLMsApp

Repository files navigation

Text Summarizer for Webinars

This repository encapsulates a demo of the final application and fine-tuned model with a comprehensive exploration into long document summarization in the context of multi-speaker webinar transcripts. Dive into my in-depth research report that navigates the complexities of long document summarization and evaluations of the current available open-source and closed-source models, along with my process of fine-tuning our very own summarization model on limited resources. As this repository doesn't contain any code for the evaluaion and fine-tuning of the models, I've collated some useful code snippets from my work here.

Date: October-November 2023
Live site: https://llm-text-summarizer.streamlit.app/ (backend is terminated at this time)
Fine-Tuned Open Source Models:

Documentation: Unleashing the Power of Large Language Models on Transcripts Summarization.pdf
Code Snippets: https://gist.github.com/jolenechong/0781431d894332ee44b7ef05caab7cbe

Here's a quick demo on the summarization features of the application and how it works.

LLM.Summarizer.Short.Demo.mp4

Architecture

Overall Architecture

Usage

Give this model a try! It's the second published model as stated above.
Here's how to use it:

# install these libraries if you haven't already
# !pip install transformers
# !pip install peft

from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

config = PeftConfig.from_pretrained("jolenechong/lora-bart-samsum-tib-1024")
model = AutoModelForSeq2SeqLM.from_pretrained("philschmid/bart-large-cnn-samsum")
model = PeftModel.from_pretrained(model, "jolenechong/lora-bart-samsum-tib-1024")
tokenizer = AutoTokenizer.from_pretrained("jolenechong/lora-bart-samsum-tib-1024", from_pt=True)

text = """[add transcript you want to summarize here]"""
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(input_ids=inputs["input_ids"])
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy())[0])

Feel free to check out the process through my documentation and code snippets as well as the first model above for more details on the fine-tuning process and the evaluation of the models.

To run the front-end streamlit application locally, follow these steps:

# create virtual environment
py -m venv ".venv"
cd .venv/Scripts
activate.bat # for windows
source .venv/Scripts/activate # for linux

# install relevant libraries
pip install -r requirements.txt

# initializing db
# might need to set up listen_addresses in postgresql.conf file to 'localhost' if it's your first time running it
py
from app import app, db
app.app_context().push()
db.create_all()

# frontend
streamlit run streamlit-app.py

Contact

Jolene - [email protected]

About

Long Document Summarization in the context of Multi-Speaker Webinar Transcripts with LangChain, Transformers and PEFT

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published