vLLM and PagedAttention

Sep 19, 2025

LLM serving with Hugging Face is slow since it’s not optimized for production use. External API calls risk data leakage, so the company serves open-source LLMs on in-house GPUs. The main bottleneck is GPU memory, especially the KV Cache, which grows with input and output tokens (e.g., OPT-13B uses ~1.6 GB per request). Traditional serving frameworks use request-level batching, but this causes blocking and padding inefficiencies; fine-grained batching fixes these but still suffers from memory fragmentation. vLLM introduces PagedAttention, splitting KV Cache into fixed blocks to remove fragmentation and allow block sharing. It supports basic decoding and parallel sampling via copy-on-write. With FCFS scheduling, evicted KV is restored by either swapping to CPU memory or recomputation. Overall, LLM serving performance is memory-bound, and vLLM improves throughput by optimizing KV Cache usage rather than relying on larger GPUs.

Building a PG Vector Client from Scratch

Sep 15, 2025

This post introduces a custom PGVector client designed to store and query vector data in an RDB-friendly way—unlike LangChain’s default client, which uses a NoSQL-like schema.

Analysis of the Transformer architecture

Sep 9, 2025

The Transformer architecture, introduced in the 2017 paper Attention Is All You Need, revolutionized natural language processing by addressing the limitations of RNNs and LSTMs. It introduced key innovations such as positional encoding and self-attention, enabling efficient parallel computation and the ability to capture long-range dependencies in sentences. These advancements have made the Transformer the foundation of modern large language models like GPT and LLaMA, driving faster training speeds and improved performance in NLP tasks.