Topics
Browse posts by category and tag — every topic we cover, with the latest pieces under each.
Tags
- #observability 5
- #production-llm 5
- #llm-ops 4
- #llm-serving 3
- #cost-optimization 2
- #deployment 2
- #infrastructure 2
- #production 2
- #prompt-management 2
- #rag 2
- #self-hosted-llm 2
- #vllm 2
- #arize 1
- #benchmark 1
- #best-practices 1
- #ci-cd 1
- #cost-monitoring 1
- #debugging 1
- #deepeval 1
- #drift 1
- #evidently 1
- #feature-engineering 1
- #governance 1
- #gpu 1
- #guardrails 1
- #inference 1
- #langfuse 1
- #latency 1
- #llm-eval 1
- #llm-inference 1
- #llm-security 1
- #llmops 1
- #mlops 1
- #model-registry 1
- #monitoring 1
- #nemo-guardrails 1
- #prompt-injection 1
- #prompt-versioning 1
- #promptfoo 1
- #ragas 1
- #ray-serve 1
- #retrieval 1
- #retrieval-augmented-generation 1
- #review 1
- #semantic-caching 1
- #sglang 1
- #tensorrt-llm 1
- #testing 1
- #token-tracking 1
- #tooling 1
- #training-serving-skew 1
- #vector-database 1
Categories
ops 8 posts
- Semantic Caching for LLM Serving: When the Cache Hit Is Not a String MatchExact-match caching misses most LLM cache hits — paraphrases tank hit rate. Semantic caching, threshold tuning, and the production failure modes that bite.
- LLM Eval Pipelines in CI/CD: Gates That Actually Catch ThingsRunning LLM evals in CI is easy to set up and easy to get wrong. How to build quality gates and red-team gates that block bad prompts before they ship —
- Prompt Versioning and Deployment: The Operational WorkflowVersioning prompts is the easy part. The operational hard parts — decoupling prompt releases from code deploys, labels for staging vs production
- RAG Observability: Monitoring the Retrieval Layer in ProductionWhen a RAG system gives a bad answer, the retrieval layer is usually to blame — and your LLM monitoring can't see it.
- Self-Hosted vs API LLMs: The Operational TradeoffsThe self-host-versus-API decision is usually framed as a cost-per-token comparison. The real tradeoffs are operational — GPU memory math, who owns
- Guardrails in the Serving Path: Defense in Depth for LLMsGuardrails are not a single check you bolt on — they're layers in the request path, each catching what the others miss.
mlops 4 posts
- Model Registry Patterns That Actually WorkWhat the hype skips about model registries, what mature teams actually do, and how to avoid the metadata graveyard most registries become.
- Training/Serving Skew: The Silent KillerHow training/serving skew happens, why it's so hard to see, and the specific places to look when your model works in eval and breaks in prod.
- MLOps Tool Review: Arize vs EvidentlyAn honest comparison of two ML observability tools—where each fits, where each frustrates, and what neither one solves.
- Concept Drift Detection in Production: Practical ThresholdsHow to actually detect concept drift in live systems, what thresholds matter, and why your monitoring dashboard is probably lying to you.
infrastructure 2 posts
- Self Hosting LLM vs API Cost: A TCO Breakdown for 2026A quantitative breakdown of self hosting LLM vs API cost — hardware, cloud GPU rental, engineering overhead, and the utilization trap that breaks most breakeven models.
- Best Vector Database for RAG: A Practical Comparison (2026)Pinecone, Weaviate, Qdrant, pgvector, Chroma, Milvus — benchmarked on recall@k, p99 latency, filtered search, and cost at real production scale.