Sep 19, 2025
LLM serving with Hugging Face is slow since it’s not optimized for production use. External API calls risk data leakage, so the company serves open-source LLMs on in-house GPUs. The main bottleneck is GPU memory, especially the KV Cache, which grows with input and output tokens (e.g., OPT-13B uses ~1.6 GB per request). Traditional serving frameworks use request-level batching, but this causes blocking and padding inefficiencies; fine-grained batching fixes these but still suffers from memory fragmentation. vLLM introduces PagedAttention, splitting KV Cache into fixed blocks to remove fragmentation and allow block sharing. It supports basic decoding and parallel sampling via copy-on-write. With FCFS scheduling, evicted KV is restored by either swapping to CPU memory or recomputation. Overall, LLM serving performance is memory-bound, and vLLM improves throughput by optimizing KV Cache usage rather than relying on larger GPUs.
Sep 15, 2025
This post introduces a custom PGVector client designed to store and query vector data in an RDB-friendly way—unlike LangChain’s default client, which uses a NoSQL-like schema.
Sep 9, 2025
The Transformer architecture, introduced in the 2017 paper Attention Is All You Need, revolutionized natural language processing by addressing the limitations of RNNs and LSTMs. It introduced key innovations such as positional encoding and self-attention, enabling efficient parallel computation and the ability to capture long-range dependencies in sentences. These advancements have made the Transformer the foundation of modern large language models like GPT and LLaMA, driving faster training speeds and improved performance in NLP tasks.