Welcome! You're here because your GenAI project is stuck, and you want to get it back on track.
Building with large language models (LLMs) is fast, but transparency is challenging. It’s tough to know where your application excels, avoid critical errors, make improvements, and predict readiness.
That’s where Performance-Driven Development (PDD) comes in. PDD gives you the clarity you need to confidently manage and deploy GenAI applications.
GenAI Challenge | Your Current Situation | Better With PDD |
---|---|---|
Performance | You rely on general observations from demos and "iterate by feel" | You can quantify performance and optimize based on measurable limitations |
Cost and Latency | You can only provide general estimates, optimized at the app level | You forecast cost and speed, and optimize both for users and tasks |
Robustness | You experience inadvertent breaks or degradation during improvements | You ensure systematic improvements with consistent performance |
Schedule | You face unpredictable schedules due to continuous experimentation | You follow a predictable schedule based on ongoing system improvements |
I started this project based on Prolego's experiences helping clients with AI projects. More background.
--Kevin Dewalt. Contact me on LinkedIn or [email protected]
Large language models (LLMs) allow rapid development of powerful solutions, but they often produce inconsistent and unpredictable results. You can quickly create a proof-of-concept that mostly works, but pinpointing successes, failures, improvements, and project completion can be challenging. This lack of transparency frustrates you and your leadership. Here's how to get your LLM project on track.
Separate your AI solution—the components interfacing with LLMs—from the rest of your infrastructure. Assign a team of AI systems engineers to work on it, as shown in Figure Q1.
Figure Q1 - Segment the AI solution from the rest of your infrastructure through interfaces (e.g., JSON specs).
Your AI systems engineers will build the AI solution using a performance evaluation framework, which provides transparency into your LLM project. This framework includes four parts, as shown in Figure Q2:
- A representative set of data and tasks covering the scenarios the LLM will encounter.
- The AI solution—interfaces to LLMs, prompts, tools, agents, orchestrations, and data preparation, forming the deployable solution from Figure Q1.
- An evaluation workflow that generates performance results based on your data and tasks.
- A performance report showing how well your solution works, its cost, and its speed.
Figure Q2 - Create transparency with a performance evaluation framework.
Use the performance report to gather customer feedback and iteratively improve your framework. Deploy your AI solution when it meets the necessary performance standards.
This approach, called Performance-Driven Development (PDD), focuses on getting the LLM to perform as needed.
While PDD may seem complex, it’s simply a different way to approach LLM-powered projects. You can build your first performance evaluation framework in 15 minutes using generated data and spreadsheets.
Your leadership has finally approved funding for your first generative AI project. You gather data and begin building your proof of concept (POC). Here’s what to expect over the next 3-6 months:
- Rapid early results will excite customers, but progress will slow as engineers disagree on the best path for improvements.
- Stakeholders will ask, “How do you know it works?” and “When will you be done?”—and you won’t have clear answers.
- A team member will unintentionally degrade the solution, leading to weeks of troubleshooting. Progress will slow further as the team works to prevent similar issues.
- Customers will lose interest due to a lack of visible progress, and stakeholders will grow impatient as you struggle to explain where the system stands and when it will be completed.
And then … nothing good happens for you or the project because you’re stuck.
Unfortunately, even experienced software engineers and data scientists find themselves stuck because traditional methodologies and tools don’t address the challenges of GenAI.
Traditional software development is notoriously challenging. Business logic must be embedded in arcane languages like C or Python to instruct computers on how to perform tasks. This code doesn’t generalize well, and even the smallest error can render a system useless. Over the past 50 years, we’ve developed methodologies, tools, and frameworks to make this process more efficient. Agile and Lean Startup methodologies help us avoid building products nobody wants, while frameworks like Ruby on Rails reduce the amount of code we need to write.
Generative AI, particularly through large language models (LLMs), offers a more efficient way to build software because it has knowledge of the world and can complete tasks with general direction.
For example, I can instruct an LLM to “remove the comments from this Python script,” and it will do so flawlessly because it understands what a Python comment is. This is far more efficient than writing 20 lines of Python to search for ‘#’ and ‘//’ characters and carefully delete them without affecting the rest of the code.
Generative AI will ultimately enable smaller teams to build more powerful software in less time. However, this power comes with a trade-off: LLMs are stochastic. For any given task, an LLM may produce different results each time or even hallucinate—generating convincing responses that are incorrect.
Table 1 highlights ways GenAI projects are different.
Attribute | Traditional Software | GenAI |
---|---|---|
Developer Interface | Compiler or interpreter | LLM API |
Interface Behavior | Deterministic | Stochastic |
Primary Risk | Market / Adoption | Technical |
Solution Logic | Written in code | Inherent in LLM |
Rate of Change | Linear | Exponential |
Frameworks / Practices | Mature | Emerging |
Iteration Cycle Time | Weeks (Sprints) | Hours |
Table 1 - LLMs are a new programming interface with different risks than traditional software.
You’re stuck because you’re trying to apply traditional software product management and engineering principles to GenAI projects. Here is how these differences manifest:
In traditional software engineering, unit, functional, and integration tests ensure that deterministic functions produce specific outputs from specific inputs. While you still need tests in GenAI to prevent bugs in your Python libraries, a different approach is required to ensure LLMs perform as desired.
Just as a developer can write a software function in many ways, LLMs can produce correct results through various approaches. Moreover, LLM outputs don’t conform to simple pass/fail evaluation criteria. An LLM can generate a result that is mostly correct and still effectively solve the business problem. Conversely, slight wording or formatting changes in a prompt that appear identical to a human evaluator can cause a system to fail catastrophically.
Best practices and tools in traditional software engineering don’t evolve as rapidly. Whether you build your product with Ruby on Rails Version 6 or 7 won’t significantly impact its success. A great programmer using a 10-year-old tech stack will still outperform an average programmer using modern tools.
In contrast, the difference between LLM versions, like GPT-3.5 and GPT-4, is so significant that it could either improve your product by 90% or render it obsolete. No level of developer skill can bridge the gap between generating text with GPT-2 and GPT-4.
Providing transparency in traditional software projects is challenging, which is why the industry has developed methodologies like Agile, tools like Jira, and roles like Scrum Master to offer some degree of predictability. Even with these, it remains difficult for project managers to predict when software projects will be completed and how well they will perform. GenAI projects are even more complex.
With no clear way to predict how an LLM will perform, teams struggle to describe how well their solution is working. Unlike traditional software, you can't predict when a GenAI project will be complete because you don’t know what obstacles you’ll face or how you'll overcome them. Additionally, it's difficult to quantify the effectiveness or improvement of your GenAI project.
Here are four steps for getting your GenAI project on track.
- Logically Isolate AI in Your Systems
- Create a Team of AI Systems Engineers
- Build a Performance Evaluation Framework
- Iteratively Optimize Your Solution with Customers
Isolating AI components from the rest of your system’s architecture enables team specialization, scalable and consistent AI capabilities, faster adoption of new models, and transparency.
Many teams begin their GenAI projects by adding LLM API calls within their existing code libraries, as shown in Figure PD1.
Figure PD1 - You can begin a GenAI project by adding LLM API calls within your existing data workflow and applications. While this design works for simple, low-risk tasks, it doesn’t provide the transparency and predictability needed for most workflows.
While this is a quick way to start and works for simple, low-risk tasks, it creates scaling challenges:
- It obscures what the LLM is doing.
- It prevents rapid testing and integration of new LLMs and approaches.
- It creates code redundancy.
A more effective approach is to isolate the AI components of your system—such as LLM interfaces, prompts, prompt orchestration agents, and tools—from the rest of the system's architecture, as illustrated in Figure PD2.
Figure PD2 - Logically isolate your AI solution from the rest of your system. The AI solution consists of components that optimize the LLM’s performance, such as prompts, data preprocessing, prompt orchestrations, and tools.
This design has several scaling advantages:
- It allows you to build a team of AI systems engineering specialists.
- It lets you offer similar AI capabilities, such as text processing, across the company.
- It enables quicker adoption of new LLMs or approaches.
Finally, it allows you to create transparency by building an LLM performance evaluation framework.
At best, frameworks like LangChain make it slightly easier to solve simple problems. Unfortunately, they further obscure what your LLM is doing and reduce transparency. We have yet to encounter an experienced team that recommends them for production.
Building solutions with LLMs isn’t traditional data science, programming, or machine learning. LLMs are a new programming interface, and it takes 6-12 months to become proficient. We call this role an AI systems engineer, combining a data scientist’s experimental mindset with a systems engineer’s big-picture view.
Here is an AI systems engineer job description and a video, The Surprising Skills You Need to Build LLM Applications.
Engineers from data science, software engineering, or systems engineering backgrounds can transition into this role with time and dedication. Even product and project managers with minimal technical experience are successfully building LLM-based solutions.
AI systems engineers will build the components that optimize the LLM’s performance, as shown in Figure PD2 above, through a performance evaluation framework.
You need a new approach for building solutions with LLMs, one that provides granular visibility into how well your solution is working. A performance evaluation framework, as illustrated in Figure PD3, provides this transparency.
Figure PD3 - A performance evaluation framework creates transparency into how the LLM is performing. It consists of (1) data and tasks, (2) AI solution, (3) evaluation workflow, and (4) performance report.
The framework consists of four components:
- A representative set of data and tasks covering the scope of the problem.
- The AI solution you will deploy.
- An evaluation workflow that runs the representative set of data through your solution and evaluates its performance on the tasks.
- A performance report showing how well your solution works and calculating key metrics.
Let's walk through each.
The foundation of your framework is a representative set of data and tasks tailored to the solution you’re building. Table 2 lists examples for common LLM solutions.
LLM Solution | Data | Tasks |
---|---|---|
RAG | Source documents | Questions and expected answers |
Document classification | Set of documents | Expected classification of each |
LLM to SQL | Populated database tables | Questions and expected results from SQL queries |
Table 2 - Examples of performance evaluation framework data and tasks designed to provide granular transparency.
This set should cover the full scope of the problem space and be used to evaluate the solution's effectiveness. If hallucinations or mistakes are concerns, create tasks specifically designed to test how the solution handles these issues.
Continuously update and revise this set as long as the solution is in production, and generate new tasks when adding data, changing tasks, or encountering operational anomalies. When legal concerns arise, such as "what if x occurs...," create a task to demonstrate how the solution addresses it.
Developing a representative set requires significant experience, experimentation, and iteration. The key challenge is determining the right quantity and type of tasks to cover common scenarios and critical edge cases without generating redundant files that don't enhance system robustness.
Some teams also use generated data to mitigate security and legal risks, enabling faster and more efficient testing of new LLMs.
The AI solution is the software that transforms the input data into an output by interacting with the LLM:
- Prompts
- Data processing scripts
- Interfaces
- Prompt orchestration libraries
- Agents
- Tools
The AI solution is what you will deploy into production as discussed previously in Create Interfaces with Your AI Solution.
You need a scalable way to measure how well your system is performing by running the representative tasks and data sets through your solution and comparing the expected vs. actual outputs.
Table 3 lists three common options teams are currently using.
Evaluation Option | Description | Pro | Con |
---|---|---|---|
Manual inspection | Developer visually inspects expected and actual outputs | Easiest to start. Most likely way to ensure accuracy. | Slow. Won’t scale. |
Scripts | Python scripts detecting presence or absence of words | Relatively easy to set up. Good at catching obvious errors. Runs fast. | Tedious to maintain. Can miss edge cases. |
LLMs | Send the expected and actual results to an LLM with instructions to evaluate. | Accuracy of manual inspection at scale. | LLMs are still not very good at it. Can get expensive. |
Table 3 - Three techniques for evaluating your AI solution’s performance.
OpenAI and others have advocated for using LLMs for your evaluation. In practice, most teams have found the combination of human review, Python scripts, and LLMs to be necessary.
Additionally, your evaluation workflow will gather key system metrics such as cost, token count, and latency. It must also provide transparency into the LLM’s sensitivity, such as running the LLMs at different temperatures to generate a confidence interval.
Finally, your framework will have a performance evaluation report that provides transparency into your solution. The report should offer granular performance visibility into every task in your set and higher-level solution metrics.
The evaluation report typically contains the following for every task:
- A confidence score comparing the actual vs. expected performance by the LLM.
- The specific LLMs used (e.g., GPT-4, LLama3-8b).
- Cost or token count.
- Latency.
- Notes or recommendations for improving performance.
Figure PD3 has a simple example.
Spreadsheets are perfectly fine for performance evaluation reports. You can import CSV files generated by Python scripts as new tabs, and both engineers and customers can quickly do analysis and provide feedback.
Figure PD4 illustrates the conceptual workflow for leveraging PDD in your existing processes.
Figure PD4 - Your AI team needs to operate independently and at a faster cycle time than the rest of your team. During a typical two-week sprint, an AI systems engineer will make dozens of improvements to their performance framework, and more powerful LLMs will be released. You can still deploy the AI solution according to your existing workflow.
Isolating the AI solution from the rest of the system architecture allows AI systems engineers to work at a faster pace and focus on the risks associated with stochastic behavior.
- Build your performance evaluation framework:
- Generate a representative set of data and tasks for the LLM.
- Create the AI solution to address these tasks.
- Evaluate how well your solution meets expectations.
- Generate a performance report.
- Review your performance report with your customer and gather feedback.
- Repeat steps 1 and 2 until your solution performs adequately.
- Create interfaces between your AI solution and the rest of your system workflow.
- Deploy your solution through your existing processes.
Continue this process, and increasingly offload more work to the LLM as AI improves.
AI is improving at an exponential rate, and tomorrow’s LLMs will be better at solving your problems. Teams have wasted months optimizing solutions for LLM limitations like speed or reasoning power, only to discard this work when better LLMs and hardware are released. Prioritize tasks the LLM can perform without customization before tackling harder ones.
Of the available 13 LLM optimization techniques, begin with simpler ones like optimizing prompts or testing different LLMs. Only pursue complex solutions like fine-tuning or agents when easier approaches are insufficient.
Let’s walk through a simple example. The source code, installation instructions, and reports are in the Example-RAG-Formula-1 directory.
In Episode 17 of our Generative AI Series we demonstrated a LLM Retrieval Augmented Generation (RAG) chat application built on the regulations issued by Formula 1’s governing body, the Fédération Internationale de l'Automobile (FIA). We also:
- held a demo and detailed technical discussion,
- published a study on the importance of context for RAG, and
- shared the results on a Webinar with Pinecone.
The application allows fans to ask detailed questions about the sport’s rules. Figure E0 is a screenshot.
Figure E0 - A screenshot of The FIA Regulation Search RAG application answers user questions about the sport’s complex rules.
The solution follows the basic RAG design as shown in Figure E1. The Documents are the Formula 1 rules located here along with additional context as described in the study.
Figure E1 - A basic RAG workflow for unstructured text documents.
Since we cover the solution and optimizations in the links above, I'll jump to the PDD methodology.
The Performance Reports Folder provides details and this summary table:
Date | Total tasks | Average Confidence | Average input tokens | Average output tokens | Average Time [s] | Notes and Recommendations |
---|---|---|---|---|---|---|
7/24/24 | 10 | MEDIUM | 54 | 77 | 1.64 | Simple QA with gpt-3.5 |
7/31/24 | 25 | LOW | 62 | 86 | 1.83 | Inaccuracies due to model |
8/7/24 | 25 | MEDIUM | 62 | 295 | 9.85 | Simple QA with gpt-4.0 |
8/14/24 | 25 | MEDIUM-HIGH | 1497 | 97 | 3.74 | Overall improved with gpt-4 |
8/21/24 | 25 | HIGH | 1855 | 130 | 5.87 | High accuracy on all but 1 |
This level of summary is ideal for stakeholder communication. It describes how the system is evolving over time based on the number of tasks (questions) and how well it is performing.
Dowload the Excel spreadsheet. The Summary tab contains the same table as above, while each tab contains the performance details for that week as shown in Figure E2.
Figure E2 - Each tab in the Performance Report spreadsheet contains the results for that week.
Here is an explanation of the first row in the 7-24 tab. (I transposed the table for easier reading).
Column | First Row | Explanation |
---|---|---|
Task | F1 QA | The task we're asking the LLM to perform: Answer user questions on F1 Rules. Other possible tasks include error-checking or guardrails. |
Question | Are race driver salaries included in the cost cap? | An example question asked by the user. |
Context | In calculating Relevant Costs, the following costs and amounts within Total Costs of the Reporting Group must be excluded (“Excluded Costs”): All costs of Consideration provided to an F1 Driver, or to a Connected Party of that F1 Driver providing the services of an F1 Driver to or for the benefit of the F1 Team, together with all travel and accommodation costs in respect of each F1 Driver. | The information from the source documents (the F1 rules) required to correctly answer the question. Important for troubleshooting. If you're not providing the correct context to the LLM, your limitation is in retrieval. |
Model | gpt-3.5-turbo-0125 | The model used to generate the response. |
Input Tokens | 48 | Number of tokens (words) used as input to the model. |
Generated Tokens | 49 | Number of tokens the model generated in response. |
Elapsed Time | 1.21 | Time taken by the model to generate the response. |
Expected | No, F1 driver salaries are not included in the Cost Cap. F1 driver salaries are exempt. | The correct answer to the user question. Human-derived from the correct context. |
Actual | No, race driver salaries are not included in the cost cap for Formula 1 teams. The cost cap primarily focuses on the team’s expenditure related to designing, developing, and running the cars. Driver salaries are separate from the cost cap regulations. | The system-generated response to the question. Compare to the Expected and generate Eval notes and Confidence. |
Confidence | 3 | A subjective score of the model's overall performance on the task. In this case 1,2,3 (Low, Medium, High). Can be generated via python scripts, human review, or LLMs. |
Eval Notes | Explanation: The actual response accurately conveys that race driver salaries are not included in the cost cap for Formula 1 teams and provides additional context about what the cost cap covers, which aligns with the expected response. | Subjective evaluation of the model's performance on the task. Can be LLM-generated with additional developer notes. |
Now that you've had a chance to review a perforance report, let's walk through the steps in generating them.
You start by building a spreadsheet of expected questions and correct answers:
Question | Expected Answer |
---|---|
How big must the rear view mirrors be? | The reflective surface of rear view mirrors must be 200x50 mm. |
Are there times when a driver is forced to be medically evaluated following an on-track incident? | Yes. If the Medical Warning Light is illuminated, signaling that threshold forces have been exceeded, then a Medical Delegate must examine the driver as soon as possible. |
You can also gather correct Context from the source documents and any other information relevant to your system design.
You then build scripts and configuration files to do the following:
- Send the questions to your solution.
- Generate an actual answer.
- Calculate key evaluation metrics like cost, speed and retrieved context.
- Store the results in a spreadsheet or other human-readable format.
You can start by doing this manually before automating it with scripts and LLMs. Check out these files:
- Evaluation Set (JSON) at eval_set.json. Evaluation questions, expected answers, and correct context in a JSON file.
- Evaluation Script (python) at eval.py. Master script that imports relevant libraries, generates results from AI solution and stores in excel format.
Once you build your basic evaluation workflow you can begin iteratively improving your solution: adding questions, review your results, identifying limitations, and making improvmenets.
You can also continuously improve your evaluation workflow and scripts to automatically generate results. The performance framework allows you to make changes with confidence.
PDD has effectively addressed your primary challenges:
- You now have transparency into where your solution is performing well and where it’s falling short.
- You have a method for managing the stochastic behavior of LLMs.
- You can focus on the highest-impact improvements rather than relying on trial and error.
- You can detect potential issues in your solution early.
Additionally, you’re able to demonstrate consistent progress, provide clear transparency, and estimate when your solution will be ready for production.
by Kevin Dewalt
I’ve been programming computers for over 40 years, and the arrival of generative AI marks the most significant shift in software engineering I’ve ever seen. Large Language Models (LLMs) are set to revolutionize the field, enabling smaller teams to build more powerful solutions at unprecedented speeds, while drastically reducing development costs. This efficiency will allow us to create AI-powered applications that not only outperform current alternatives but also tackle increasingly complex problems.
We’re standing on the brink of a new golden era for software engineering, but two major obstacles stand in our way. First, AI is still in its infancy, and LLMs are not yet adept at solving most business problems. Extensive customization is required to enable LLMs to access structured data or perform beyond basic reasoning. The good news is that LLMs are improving at an exponential rate, and we can expect many of these challenges to be resolved within the next two years.
The second challenge is that we lack effective methodologies for developing solutions with LLMs. Many teams are trying to repurpose existing tools and approaches, with mixed results. Application frameworks like Ruby on Rails and Django revolutionized data-driven web development, but frameworks like LangChain are currently hindering efficiency. Similarly, product strategies like Agile and Lean Startup, which are designed to mitigate adoption and market risks, are distracting teams from the more pressing challenge of getting LLMs to perform as intended.
As a result, many LLM projects are stalled. Teams struggle with a lack of transparency into how their solutions are functioning, and they lack the tools to determine whether they should refine their models through fine-tuning or by implementing more complex prompt orchestration. They can’t predict when their solution will be good enough or how much it will cost. Even worse, many are investing time and resources into building software that will be obsolete within two years.
Fortunately, the most effective engineering teams are pioneering a new approach, which we call Performance-Driven Development (PDD). This methodology addresses the two central challenges of working with LLMs: (1) they behave stochastically, and (2) they are improving at an exponentially-fast rate. PDD will help you get LLMs to perform as desired.
My journey with PDD began in April 2023, after reading Sparks of Artificial General Intelligence: Early experiments with GPT-4, by Sébastien Bubeck et al. It became immediately clear that software engineering as we knew it was about to undergo a permanent transformation. I began developing LLM-based solutions with Justin Pounders and other engineers at Prolego, sharing our findings through open-source examples and weekly YouTube videos. We also engaged with numerous teams and reviewed case studies from companies that were seeing early success.
Over time, it became clear that much of my previous experience in software, machine learning, and data science didn’t apply to generative AI. I was struck by how quickly our team could build solutions and how little actual code was required. At the same time, I realized the importance of maintaining transparency into how LLMs were performing at the task level to avoid getting stuck in a perpetual cycle of “iterate by feel.” We also connected with other engineering teams who were converging on similar issues.
By April 2024, the key concepts of PDD had crystallized, and we began integrating performance evaluation frameworks into our client work. Partnering with our clients’ engineering teams, we were able to turn around stalled projects in a matter of weeks, providing the transparency and predictability they needed. PDD proved so effective that Russ and I decided to refocus Prolego entirely around this methodology.
This project is my attempt to document the PDD methodology, based on what’s actually working for those of us who make a living by delivering solutions that solve real problems. If that resonates with you, I hope this guide helps you as much as it’s helped us—and that together, we can continue to push the boundaries of what’s possible.
Please submit issues and pull requests of your suggestions or feedback. Or email me at [email protected].
LLMs are new, and unfortunately, separating the AI hypesters and the real practitioners isn’t easy. Here are some pros whose work I admire who helped make this document better.
Thank you Justin Pounders, Craig Dewalt, Shanif Dhanani for reviewing the early, ugly drafts.
Copyright Copyright 2024, Prolego, Inc. All rights reserved.