Introduction to RAG | Chunking, Embeddings...

March 5, 2025



⚠️ Remember: This post is an introduction to RAG concepts as I understand them. I'm sharing my learning process—I can make mistakes. Philosophy about the blog.

💼 Open to work opportunities in web development. Let's connect!

🔗 Want to see how this project was built? Check out to the project section!

Introduction

Retrieval-Augmented Generation (RAG) has become a fundamental approach in modern AI applications, enabling models to provide more accurate, up-to-date, and contextually relevant responses. Unlike traditional language models that rely solely on their training data, RAG systems can retrieve and incorporate external knowledge when generating responses.

In this introduction, we'll explore the essential components that make RAG work effectively: chunking and embeddings. Understanding these concepts is crucial for anyone looking to implement their own RAG system or optimize an existing one.

Note: This article assumes basic familiarity with language models but aims to make these concepts accessible to technical and semi-technical readers alike.

RAG Fundamentals

What is RAG?

Retrieval-Augmented Generation (RAG) is a hybrid approach that combines the strengths of retrieval-based systems with generative AI. The process involves:

  1. Retrieval: Fetching relevant documents or data from a knowledge base
  2. Augmentation: Passing the retrieved data to a generative model
  3. Generation: Producing a response based on both the retrieved context and the model's capabilities

This approach addresses a key limitation of traditional language models: their inability to access specific information beyond their training data.

Key Components

A typical RAG pipeline consists of several core elements:

  1. Document Processing: Converting raw documents into a searchable format
  2. Chunking: Breaking documents into smaller, manageable pieces
  3. Embedding Generation: Creating numerical representations of text chunks
  4. Vector Storage: Efficiently storing and indexing these embeddings
  5. Retrieval Mechanism: Finding the most relevant information for a query
  6. Generation: Using retrieved context to produce accurate responses

While each component is important, we'll focus specifically on chunking and embeddings—two crucial elements that significantly impact RAG performance.

Chunking

Chunking is the process of splitting long texts into smaller, more manageable pieces. This might seem like a simple preprocessing step, but it profoundly impacts the effectiveness of your entire RAG system.

Why Chunking Matters

There are several compelling reasons to implement chunking in your RAG pipeline:

  1. Improved Retrieval Precision: Smaller chunks allow for more precise matching of relevant content to specific queries.

  2. Working Within Model Limits: Most embedding models have token limits. Chunking ensures your content fits within these constraints.

  3. Context Window Optimization: LLMs have limited context windows. Smaller, more relevant chunks make better use of this limited space.

  4. Better User Experience: Returning concise, relevant chunks is more useful than overwhelming a user with entire documents.

Without effective chunking, your RAG system might struggle with precision and efficiency, especially as your knowledge base grows.

Chunking Strategies

Several approaches can be used for chunking, each with its own strengths:

  1. Fixed-Length Chunking: Splitting by character or word count (e.g., every 500 characters or 200 words)

    • Pros: Simple to implement
    • Cons: May break semantic units
  2. Semantic Boundary Chunking: Splitting by sentences, paragraphs, or sections

    • Pros: Preserves natural language boundaries
    • Cons: More complex to implement
  3. Token-Based Chunking: Using a tokenizer to split by token count

    • Pros: Directly addresses model constraints
    • Cons: Requires understanding of tokenization
  4. Hybrid Approaches: Combining methods (e.g., splitting by paragraphs but ensuring no chunk exceeds a token limit)

    • Pros: Balanced approach
    • Cons: Increased implementation complexity
Chunking StrategyBest Use Case
Fixed-LengthSimple implementations, homogeneous content
SemanticContent with clear structure (articles, documentation)
Token-BasedWhen working with specific LLM API constraints
HybridProduction systems requiring balance of precision and performance

Real-world Example

Let's consider a practical example to illustrate chunking in action:

Imagine you have a 1,000-word article about climate change. Using a semantic chunking approach, you might divide it into:

Chunk 1: Introduction and problem statement [180 words]
Chunk 2: Historical climate data [220 words]
Chunk 3: Current impacts [250 words]
Chunk 4: Future projections [200 words]
Chunk 5: Potential solutions [150 words]
Document Chunking Process

This approach preserves the logical structure of the article while creating manageable units for embedding and retrieval.

Embeddings

Embeddings are the numerical representation of text that capture semantic meaning in a way that computers can process. They form the foundation of how RAG systems understand and retrieve information.

What Are Embeddings?

An embedding is essentially a list of numbers (a vector) that represents the meaning of a piece of text. Modern embedding models convert words, phrases, or entire chunks of text into high-dimensional vectors that capture semantic relationships.

For example, in a good embedding space:

  • Similar concepts have similar vector representations
  • Different meanings of the same word are distinguished by context
  • Relationships between concepts are preserved (e.g., "king" - "man" + "woman" = "queen")
What Are Embeddings ?

Understanding Embedding Dimension

Embedding models like those from Mistral, OpenAI, or Anthropic typically produce vectors with hundreds or thousands of dimensions. For instance, Mistral's embedding model creates 1024-dimensional vectors.

This might sound abstract, so let's break it down with a simpler analogy:

Imagine describing a person using just three numbers:

  • Height: 1.75m
  • Weight: 70kg
  • Age: 25 years

That's a 3-dimensional vector describing a person.

Text embeddings work similarly but use many more dimensions (e.g., 1024) to capture nuanced aspects like:

  • Topic and subject matter
  • Writing style and tone
  • Entity relationships
  • Semantic meaning
  • And many other language features
Understanding Embedding Dimensions

The power of embeddings comes from how they enable similarity search—finding related content by measuring the "distance" between vectors.

When a user asks a question, the RAG system:

  1. Creates an embedding of the question
  2. Compares it to the embeddings of all chunks in the database
  3. Retrieves the chunks whose embeddings are closest to the question's embedding

This vector similarity search is what allows RAG systems to find relevant information without relying on exact keyword matching.

// Example (conceptual)
"I love cats"  [0.1, 0.2, 0.3, ..., 0.9]  // 1024 numbers
"I like cats"  [0.11, 0.19, 0.31, ..., 0.89]  // similar numbers
"I hate math"  [-0.5, 0.8, -0.2, ..., 0.1]  // very different numbers
Finding Similar Content

Understanding Tokens

When implementing RAG, you'll frequently encounter the concept of "tokens," which play a crucial role in both chunking and embedding processes.

Token Basics

Tokens are the basic units that language models process. They don't always correspond directly to words:

  • "hello" → 1 token
  • "understanding" → 2 tokens ("under" + "standing")
  • "I love cats" → 3 tokens
  • "🌟" (emoji) → might be 1-2 tokens

The exact tokenization depends on the specific model and tokenizer being used.

Tokens in RAG Pipelines

Understanding tokens matters for several reasons in RAG systems:

  1. API Limits: Most embedding and LLM APIs have token limits per request

    • Example: If a limit is 8,192 tokens and your text is 10,000 tokens, you'll need to chunk it
  2. Cost Considerations: Many APIs charge per token processed

    • More tokens = higher costs
  3. Chunking Strategy: Token-aware chunking ensures you stay within API constraints

    • Character-based chunking might create chunks that exceed token limits
When to Care About TokensRecommendation
Using embedding APIs with strict limitsImplement token-counting in your chunking strategy
Cost-sensitive applicationsMonitor token usage and optimize chunk sizes
Simple prototypingStart with character-based chunking; add token awareness if needed
High-volume production systemsImplement token-based chunking with overlap

Implementation Considerations

When implementing chunking and embeddings in your RAG system, consider these practical tips:

  1. Chunk Size Trade-offs:

    • Too large: Less precise retrieval, may hit token limits
    • Too small: More storage overhead, might lose context, higher API costs
  2. Chunk Overlap:

    • Including some overlap between chunks (e.g., 10-20%) helps preserve context
    • Especially important for semantic chunking where concepts might span chunk boundaries
  3. Storage Architecture:

    • Store not just embeddings but also:
      • Original document reference
      • Chunk metadata (position, source, etc.)
      • Raw text for context window insertion
  4. Database Selection:

    • Vector databases (like pgvector) optimize similarity search
    • Consider hybrid approaches that combine vector and keyword search
  5. Embedding Model Selection:

    • Domain-specific vs. general-purpose
    • Dimensionality (higher isn't always better)
    • Throughput vs. quality trade-offs

Conclusion

Chunking and embeddings form the foundation of effective RAG systems. By properly implementing these components, you can significantly improve the relevance, accuracy, and efficiency of your AI applications.

Remember:

  • Chunking is about breaking content into the right-sized pieces for your specific use case
  • Embeddings transform text into a format that machines can understand and compare
  • The right balance of chunk size, overlap, and embedding quality can dramatically improve retrieval performance

Whether you're building a custom knowledge base, enhancing customer support, or developing research tools, mastering these concepts will help you create more effective systems.

RAG Embedding Process