Patterns for building systems and products


  • Measure hoe well the system/product is doing
  • Detect regressions


  • MMLU
    • Set of 57 tasks that span elementary math, US history, computer science, law, etc.
    • Models must demonstrate extensive world knowledge and problem-solving abilities
  • EleutherAI Eval
    • Unified framework to test models
    • Uses zero/few-shot settings on 200 tasks
    • Incorporates a large number of evals
  • HELM
    • Comprehensive assessment across many domains
    • Metrics include accuracy, calibration, robustness, fairness, bias, toxicity, etc.
  • AlpacaEval
    • Automated evaluation framework
    • Measures how often a strong LLM prefers the output of one model over a reference model
    • Metrics include win rate, bias, latency, price, variance, etc.
    • Validated to have a high agreement with 20k human annotations


  • Come in two categories
    • Context-dependent
      • Take context into account
      • Often proposed for a specific task
        • Using them for other tasks requires tweaking
    • Context-free
      • Not tied to context when evaluating outputs
      • Only compare output with provided gold references
      • Task agnostic
        • Easier to use with a wide variety of tasks
  • BLEU

Retrieval Augmented Generation

  • RAG
  • Fetches relevant data from outside the model and enhances input with the data
  • Provides richer context to improve output
  • Helps reduce hallucinations by grounding the model with context
  • Cheaper than continuously pre-training an LLM
  • Easier to remove biased or toxic documents


  • Process of taking a pre-trained model and refining it for a specific task
  • Can refer to several concepts
    • Continued pre-training
      • Apply to same pre-training regime on the base model but with domain-specific data
    • Instruction fine-tuning
      • Pre-trained model is fine-tuned with instruction-output pairs
      • Model is made to follow instructions
    • Single-task fine-tuning
      • Model is honed for a narrow/specific task
      • Similar to BERT and T5
    • Reinforcement learning with human feedback (RLHF)
      • Combines instruction fine-tuning with reinforcement learning

Why fine-tune?

  • Performance and control
    • Improve performance of off-the-shelf models
    • Greater control over LLM behaviour
  • Modularization
    • Enables using multiple models that are good at different things
  • Reduced dependencies
    • Less legal risk if you own the models (not needing external APIs)
    • Can get around third-party issues like rate-limiting, high costs, or restrictive filters


  • Soft prompt tuning
    • Prepends trainable tensor to the model's input embeddings
  • Prefix tuning
    • Prepends trainable parameters to the hidden state of all transformer blocks
  • Adapter technique
    • Adds fully connected network layers twice to each transformer block
  • Low-rank Adaptation (LoRA)
  • QLoRA

How to apply?

  1. Collect data/labels
  2. Define evaluation metrics
  3. Select a pre-trained model
    • Falcon-40B is known to perform well (unwieldy to fine-tune)
    • Falcon-7B
  4. Update the model architecture
  5. Pick a fine-tuning approach
    • LoRA or QLoRA good places to start
  6. Hyperparameter tuning


  • Store data previously retrieved or computed
  • Popular approach is to cache LLM response keyed on embedding of the input request
    • If any new request is semantically similar, serve the cached response


Defensive UX