Skip to content

Key Ideas

In their rawest form, AI Agents have the following structure:

model = "gpt-3.5-turbo" # Model
model_endpoint = "https://api.openai.com/v1/chat/completions" # just one endpoint
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {api_key}",
}
# payload may vary from LLM Organisation but it is a text string.
payload = {
   "model": model,
   "messages": [
       {"role": "system", "content": system_prompt},
       {"role": "user", "content": user_prompt},
   ],
   "stream": False,
   "temperature": temperature, 
}
# Use HTTP POST method
response = requests.post(
   url=model_endpoint, # The endpoint we are sending the request to. Low Temperature:
   headers=headers, # Headers for authentication etc
   data=json.dumps(payload) # Inputs, Context, Instructuins
).json()

We can therefore log INPUTS and OUTPUTS as needed with the response containing a wealth of information.

This gives us full observability - in fact the maximum possible - and we can then proceed to EVALS.

  1. LLM evals ≠ benchmarking.

  2. LLM evals are a tool, not a task.

  3. LLM evals ≠ software testing.

  4. Manual + automated evals.

  5. Use reference-based and reference-free evals.

  6. Think in datasets, not unit tests.

  7. LLM-as-a-judge is a key method. LLM judges scale rather than replace human evals.

  8. Use custom criteria, not generic metrics.

  9. Start with analytics.

  10. Evaluation is a moat.

The reason why an LLM judge gives label helps teams know what to fix...

Single most important point

We avoid numeric scores but use categorical labels. What does a 7 mean? LLM judges can have different scoring systems.

do not use numerical

This research showed that LLMs give 1 or 10 not a linear scale.

avoid numerical

LLM evalas are part of a bigger testing programme.

chat-router

Not all evals are LLM based - traditional code based evals are also useful and in some ways better.

evals

We can have a matrix of evaluations in one component.

evals