← Back to home

Writing

Blog

Technical writing on distributed systems and AI engineering — production LLM infrastructure, agent observability, RAG, and system design from Ashwani Jha.

What Every Backend Engineer Should Know About Attention

RNNs forced you to wait for token 100 before processing token 101. Transformers parallelize the whole sequence. Here's why that matters for production systems.

Teaching LLMs to reach for a calculator

A 6.7B model that knows when to call a calculator beats GPT-3 175B on math. Toolformer's self-supervised approach to tool use is worth understanding before you hardcode a tool-calling chain.

Self-correcting agents in production

The labeling bottleneck is real. Constitutional AI teaches agents to critique themselves using principles — fewer human labels, faster iteration, new tradeoffs.

LoRA Fine-Tuning: What the Microsoft Paper Actually Says

LoRA cuts trainable parameters by 10,000x and matches full fine-tuning performance. Here's what Microsoft's paper actually says about low-rank adaptation.

Scaling laws are not just about research budgets

Loss follows a power law across seven orders of magnitude of compute. Kaplan et al.'s scaling laws are a decision framework — not just research trivia.

RAG in Production: Warnings from the Original Paper

RAG is in every AI pitch deck. Most skip the paper's failure modes: retrieval collapse, frozen encoders, approximate MIPS. Lewis et al. built something subtler.

MapReduce: Google's 2004 Paper and Your 2026 Decisions

Dean & Ghemawat's 2004 MapReduce decisions — stragglers, data locality, combiner functions — are the ones you still make in Spark and Flink today.

How LLMs learn to use tools

Rule-based tool routing is brittle. Supervised annotation is expensive. Toolformer shows a third path — let the model decide where tools help, filter on loss, and fine-tune. The numbers are worth understanding before you build your next agent.

How I think about agent memory

Most LLM agents are amnesiacs. The fix isn't a bigger context window — it's a memory system with four layers, explainable retrieval, and a feedback loop.

In-Context Learning Is Not Magic: What GPT-3 Actually Shows

The GPT-3 paper is cited constantly and read rarely. It documents failure modes, a data contamination bug, and benchmark gaps that matter in production.

Dynamo's Tradeoffs: What Amazon's 2007 Paper Still Teaches

Amazon's 2007 Dynamo paper defined the tradeoffs every distributed storage system still makes. Eventual consistency, conflict resolution, and availability over correctness.

Constitutional AI: What Anthropic's Paper Actually Says

RLHF for harmlessness requires labeling harmful outputs at scale. CAI replaces that with a model critiquing itself against a written constitution.

BERT and the Fine-Tuning Paradigm: What the Paper Built

The 2018 BERT paper defined the fine-tuning paradigm behind every embedding model and text classifier you use today. Here's what Devlin et al. actually built.

How I think about agent observability

Traditional APM was built for web requests, not agents that loop, retry, branch, and spend dollars per call. Here's what agent observability actually needs.

The AI agent observability stack

Agent observability is five different problems. Different tools solve different ones — traces, hallucinations, cost, drift. Here's a map and where to start.

Building a Brain: Cognitive Architecture for AI

Most AI assistants are stateless, waking up blank every session. I built Friday a brain — four-layer memory, BDI runtime, and adaptive learning.