Topics

Browse posts by category and tag — every topic we cover, with the latest pieces under each.

Tags

Categories

ops 8 posts

Semantic Caching for LLM Serving: When the Cache Hit Is Not a String Match

Exact-match caching misses most LLM cache hits — paraphrases tank hit rate. Semantic caching, threshold tuning, and the production failure modes that bite.
LLM Eval Pipelines in CI/CD: Gates That Actually Catch Things

Running LLM evals in CI is easy to set up and easy to get wrong. How to build quality gates and red-team gates that block bad prompts before they ship —
Prompt Versioning and Deployment: The Operational Workflow

Versioning prompts is the easy part. The operational hard parts — decoupling prompt releases from code deploys, labels for staging vs production
RAG Observability: Monitoring the Retrieval Layer in Production

When a RAG system gives a bad answer, the retrieval layer is usually to blame — and your LLM monitoring can't see it.
Self-Hosted vs API LLMs: The Operational Tradeoffs

The self-host-versus-API decision is usually framed as a cost-per-token comparison. The real tradeoffs are operational — GPU memory math, who owns
Guardrails in the Serving Path: Defense in Depth for LLMs

Guardrails are not a single check you bolt on — they're layers in the request path, each catching what the others miss.

mlops 4 posts

infrastructure 2 posts

inference 1 posts

Best LLM Serving Frameworks 2026: vLLM, SGLang, TensorRT-LLM, and Ray Serve Compared

How vLLM, SGLang, TensorRT-LLM, and Ray Serve stack up on throughput, TTFT, and operational complexity — and which one fits your workload in 2026.