This guide presents an example of using Lamini LLM pipeline to implement a performant and reliable LLM processing workflow to generate meaningful questions and answers from earning call transcripts of publicly-traded companies. The pipeline can be seen as "chat with earning scripts in batches".
The source code is in generate_data.py, and we'll walk through the code in the rest of this guide. You can read the source code and comments in generate_data.py to better understand how GenerationPipeline works.
We use Llama 3 in this guide. Llama 3 can read english and reason over it. We insert processing before and after calling to Llama3 inference RPCs.
Run the follow script to have Llama 3 read through earnings calls, pretend to be a financial analyst, and ask relevant questions, and answer them using the source text.
cd 05_data_pipeline
python3 generate_data.py
We are only generating QA for the first line for this example since the transcript is massive. Below is a sample of the output of the data pipeline.
{
"company": "WPP",
"question": "What is the percentage growth rate of WPP's business in Germany in Q1, according to Mark Read?",
"answer": "16%"
}
{
"company": "GDOT",
"question": "What is the size of the asset size that GDOT aims to maintain to protect its revenue",
"answer": "According to the transcript, GDOT aims to maintain an asset size of $10 billion or less to protect its revenue"
}
The Lamini LLM pipeline will automatically distribute your LLM calls over the entire cluster so you don't have to think about thread pools and batching.
LLMs are extremely computationally intensive. Processing even a modest amount of data (e.g. GBs) may require hundreds of GPUs to process quickly. So we recommend using this interface for any data processing with more than ~100 LLM calls.
Pipeline also has automated retry to make sure transient failures in calling Llama 3 inference RPCs do not break down the whole pipeline.
A Lamini LLM pipeline is a series of stages.
Each stage is implemented as a subclass of GenerationNode
class.
Each stage accepts an AsyncGenerator
and produces another AsyncGenerator
.
In this guide, the pipeline is defined in QuestionAnswerPipeline
.
It has two stages: QuestionGenerator
and AnswerGenerator
, as shown in the forward()
function below.
lamini-examples/05_data_pipeline/generate_data.py
Lines 19 to 33 in 70accea
We need to provide input to and save the results from the pipeline.
This is shown in run_pipeline()
below, where the input was provided by load_earnings_call()
,
and the results are saved by save_answers()
:
lamini-examples/05_data_pipeline/generate_data.py
Lines 148 to 151 in 70accea
The input to the pipeline is provided by load_earnings_call()
, which is an AsyncGenerator
,
because GenerationNode
subclasses requires an input as AsyncGenerator
.
lamini-examples/05_data_pipeline/generate_data.py
Lines 121 to 127 in 70accea
The first stage reads a passage from an earnings call, and ask LLMs to generate three questions about it.
This is achieved by the prompt on line 79 in make_prompt()
and output_type
in postprocess()
to force it to generate three questions and automatically parse them:
lamini-examples/05_data_pipeline/generate_data.py
Lines 52 to 83 in 70accea
One can define their own preprocess()
to transform an PromptObject
of a GenerationNode
before passing it
to remote LLM inference API. Additionally, postprocess()
to transfrom the result from LLM inference API.
In this example, QuestionGenerator
has its own preprocess()
& postprocess()
:
lamini-examples/05_data_pipeline/generate_data.py
Lines 48 to 61 in 70accea
The answer generator is similar, just with a different prompt. You can control it by editing the prompt.
lamini-examples/05_data_pipeline/generate_data.py
Lines 85 to 118 in 70accea
The output of the final GenerationNode
is an AsyncGenerator
that should be saved somewhere.
This is done in save_answers()
, which uses async for
to iterator through the results,
and write them into an output file.
lamini-examples/05_data_pipeline/generate_data.py
Lines 129 to 145 in 70accea