Original Post: Does your LLM thing work? (& how we use promptfoo)
The article from the AI team at Semgrep discusses their experiences and strategies in implementing AI features using prompt chains. They identify three categories of quality metrics:
-
Behavior Metrics: These are the primary reasons for using AI, such as decreasing the time to resolve security vulnerabilities. However, isolating AI’s impact in a larger workflow can be challenging, requiring large sample sizes for significant observations.
-
Feedback Metrics: Explicit user feedback, often collected via thumbs up/down buttons. Larger sample sizes and balanced user interfaces can minimize bias; however, user feedback might be swayed by initial user excitement or disapproval.
- Laboratory Metrics: These include reproducible in-team evaluations like test suites and scoring outputs, providing a shorter feedback loop but necessitating complex infrastructure setups.
To operationalize these metrics, the team stresses the importance of:
- Querying the LLM as production does.
- Comparing AI outputs with human-written benchmarks.
- Implementing rules ensuring template variables and prompt rendering are consistent, immutable, and serializable.
The team experienced difficulties with end-to-end testing and opted for unit testing to reduce infrastructure overhead and improve testing efficiency.
For their test setup, they recommend using promptfoo, a community-driven test runner, to manage prompts, variables, and providers, ensuring their system can:
- Retrieve production-realistic prompts.
- Use a persistent database of test cases for static and dynamic features.
- Capture real user activities to simulate realistic scenarios.
They emphasize automating these processes to maintain consistency and rapid evaluation of new models, prompt texts, and context inclusions.
The article concludes with a mention of further features under development, hinting at utilizing model consensus and log probabilities for more sophisticated evaluations.
Go here to read the Original Post