Callum Macpherson

I'm an AI engineer in London. At Runna , I build the ML systems that make millions of people healthier runners. Before that, I spent time at AWS helping customers ship custom AI — from post-trained LLMs and autonomous agents to computer vision pipelines — across robotics, finance, healthtech, aviation, and more. I've worked with every modality you can throw at a GPU: video, images, audio, text.

I write about the engineering that gets models from notebook to production: retrieval systems, voice agents, evaluation frameworks, and the deployment details nobody warns you about.

Social Links:

Featured

Capacity Planning Qwen3-TTS on an A10G

17 Jul, 2026

A single-GPU capacity study for streaming Qwen3-TTS: SLOs, vLLM-Omni comparison, fleet estimates, and a proposed production architecture.
How to Build a Low-Latency Streaming Qwen3-TTS Server

15 Jul, 2026

Turning Qwen3-TTS into a deployable AWS streaming service with CUDA graphs, FastAPI WebSockets, and measured sub-200ms first-audio latency.

Callum Macpherson

Featured

Capacity Planning Qwen3-TTS on an A10G

How to Build a Low-Latency Streaming Qwen3-TTS Server

Recent Posts

Advanced Retrieval for Retrieval-Augmented Generation

LLMs Evals: A General Framework for Custom Evaluations

Implementing RAG in LangChain with Chroma: A Step-by-Step Guide

Malicious LLM Prompt Detection in Python