LLM

Patterns for building systems and products

MMLU
- Set of 57 tasks that span elementary math, US history, computer science, law, etc.
- Models must demonstrate extensive world knowledge and problem-solving abilities
EleutherAI Eval
- Unified framework to test models
- Uses zero/few-shot settings on 200 tasks
- Incorporates a large number of evals
HELM
- Comprehensive assessment across many domains
- Metrics include accuracy, calibration, robustness, fairness, bias, toxicity, etc.
AlpacaEval
- Automated evaluation framework
- Measures how often a strong LLM prefers the output of one model over a reference model
- Metrics include win rate, bias, latency, price, variance, etc.
- Validated to have a high agreement with 20k human annotations

Performance and control
- Improve performance of off-the-shelf models
- Greater control over LLM behaviour
Modularization
- Enables using multiple models that are good at different things
Reduced dependencies
- Less legal risk if you own the models (not needing external APIs)
- Can get around third-party issues like rate-limiting, high costs, or restrictive filters

Soft prompt tuning
- Prepends trainable tensor to the model's input embeddings
Prefix tuning
- Prepends trainable parameters to the hidden state of all transformer blocks
Adapter technique
- Adds fully connected network layers twice to each transformer block
Low-rank Adaptation (LoRA)
QLoRA

Collect data/labels
Define evaluation metrics
Select a pre-trained model
- Falcon-40B is known to perform well (unwieldy to fine-tune)
- Falcon-7B
Update the model architecture
Pick a fine-tuning approach
- LoRA or QLoRA good places to start
Hyperparameter tuning

Store data previously retrieved or computed
Popular approach is to cache LLM response keyed on embedding of the input request
- If any new request is semantically similar, serve the cached response

Ensure output quality
Validate LLM output and make sure it sounds good, is syntactically correct, factual and free from harmful content
Helps to make sure outputs are reliable and consistent for production
Tools:
- https://github.com/microsoft/guidance
- https://github.com/ShreyaR/guardrails

Anticipate and manage any inaccuracies or hallucinations
- ie. handle errors gracefully
LLMs are not deterministic which makes it hard to create consistent UI/UX
Microsoft's Guidelines for Human-AI Interaction
- More focused on mental models
Google’s People + AI Guidebook
- More focused on training data and model development
Apple’s Human Interface Guidelines for Machine Learning
- More focused on UX
- How Apple's design principles can be applied to ML-infused products