Anthony Morris AI
AI World's Fair
Day One
Day Two
Measuring the ROI of AI
Beyang Liu from Sourcegraph
Coding without AI
Should be straight forward
Gaps appear
Bridging gaps takes time and focus
AI helps bridge gaps experienced when coding
Allows us to get to the finish line faster
Everyone is trying to understand the ROI of AI
Measuring the ROI is MP-hard
MP --> "measure productivity"
If you can measure human productivity then you can measure AI ROI
Frameworks for evaluation AI ROI
Roles eliminated
How many roles can you eliminate as you increase productivity?
Hasn't been observed by Sourcegraph in the software industry
A/B velocity
How it works
Measure two groups with/without a tool
Which group was more productive?
Criticisms
High cost and effort
No two teams are the same
Time saved as a function of engagement
How it works
Identify key product actions
Estimate time saved per action (lower bound)
Measure product actions
Criticisms
No second order effects measured
Hard to map to large company initiatives
Impact on KPIs
How it works
Track KPIs
Tie KPIs to dollar value or business impact
Criticisms
LoC is a bad metric
DORA is ops efficiency, not dev productivity
Hard to map KPIs to $ saved
Impact on key initiatives
How it works
Identify how a tool can help with a key initiative
Prove that the tool accelerates the initiative
Criticisms
"Move faster" isn't always an OKR
Survey
How it works
Criticisms
Not direct connection to ROI
Be skeptical of claims that P=NP
It's hard to perfectly measure ROI
Define clear success criteria
Dev tools are cheap but dev time is often misvalued
Should we continue measuring engineers the same way in the world of AI automation?
Mythical Man Month
100x lever for small productive teams (as few people as possible)
Productionizing GenAI Models
Lukas Biewald from Weights and Biases
"How many people have LLMs in production?"
"Wow, more than I've ever seen. Why'd ya'll come to this talk?"
"Because it's not going well..."
Democratization of AI is here
Every software engineering team is likely to add AI capabilities
AI: easy to demo, hard to productionize
AI somehow makes CEOs stupid
The bar for quality seems to have gone done, even though it was already low for software
Software development is linear while AI development is experimental
Fundamentally different workflows
Software development --> code is IP
AI development --> learning is your IP
How do you save your lessons?
Reproducibility matters
Lessons for building LLM-powered apps
Build an eval framework
Start with lightweight prototypes
Incorporate end-user feedback
Iterate
Eval best practices
Use multiple eval sets/techniques
Make sure metrics correlate with the UX
Do it
Building production RAG systems at scale (with 10s of millions users)
Nikhil Thota from Perplexity
Ends up being systems engineering problems
Backend perf
Distributed systems
Model sizes
UX
Progressive output
Users will wait if they think it'll be worth it
Eg. show all the work that Perplexity is during
Perceived latency
Streaming to make it feel like stuff is happening
Evaluating quality
Big investment in tooling and human effort
Eval intermediate steps and thought processes
Errors compound in the system
"Scratchpad"
Perplexity uses Chain of Thought so the model explains itself
Can look into and debug the CoT
Handle non-determinism and edge cases
LLM as a Judge scales better but is not nearly as accurate
Still rely a lot on human annotators
Orchestration
Coordinating between all the systems is complex
Build to be model agnostic
Everything is built in house
Use keyword search + embeddings
Collections
Allows you to make groupings for multiple questions
Can define a prompt for any question in the collection
GitHub Next Explorations
Rahul Pandita from GitHub
"AI is the new electricity"
Andrew Ng believes AI will transform every industry just like electricity did many years ago
GItHub Next's charter is to explore the future of engineering
Copilot Next Editor Suggestion
Suggest a change somewhere your cursor isn't
Eg. if you change a parameters for a function call, suggest updating documentation
GitHub Workspace
"We shall not cease from exploration, and the end of all our exploring will be to arrive where we started and know the place for the first time." ~ T.S. Eliot
Cursor: Building the Human-AI Hybrid Engineer
Model is trained to predict next edit (Copilot++)
Hard problems being solved
Sub-300ms next-action model
Code-specific speculative inference
Merkle tree file sync server
Context building engine
Scalable doc scraping infra
Model caching
Auto-updating framework
Remote perf profiles
etc, etc
The AI emperor has no DAUs: why most devs still don't use code AI
Quinn Slack from Sourcegraph
Best case estimate: ~5% of professional developers use AI to code
Why devs don't use AI?
It isn't perfect
Don't need it
Company doesn't use it
Too expensive
Security/privacy concerns
Usage needs to grow a lot
Lessons from building Cody
Hype fools you and your customers
You have to be a DAU yourself
Don't always listen to the customers, understand what they really want/need
If it's so good, why not exploit it vs sell it?
Tweets do not transfer to DAU
AI code completion is too perfect as a feature
Most other code AI features don't come close
The Future of Knowledge Assistants
Jerry Liu from LlamaIndex
A better knowledge assistant
Advanced data and retrieval modules
Good RAG begins with parsers
Good parsing can improve performance
Helps reduce hallucinations
Advanced single-agent query flows
Ingredients
Routing
Tool use
One-shot planning
Memory
Agentic RAG
Interact with data services as tools, not just a simple query --> vector DB
Remaining gaps
Specialist agents tend to be better at tasks
Agents fail when there are too many tools available
Interacting with other agents isn't great
General multi-agent task solvers
Why?
Specialization
Parallelization
Cost/latency
Challenges
Defining service architecture
Constrained vs unconstrained flows
Llama Agents
Agents as microservices
Service oriented, distributed architecture
Communicate via message queue
Encapsulation and modularity
Communication via standardized APIs
Easy to deploy
Scalability and resource management
LLM apps are only as good as your data