Chunking and Ingestion

Document chunking and ingestion are essential for building RAG (Retrieval-Augmented Generation) applications. Chunking breaks large documents into manageable pieces that fit within LLM context windows, while the ingestion pipeline orchestrates embedding and storage.

Overview

The chunking and ingestion system includes:

Chunkers - Split text into pieces using different strategies
IngestionPipeline - Orchestrates chunking, embedding, and storage
Metadata Tracking - Maintains context and linking between chunks
Progress Monitoring - Real-time updates during processing

Chunking Strategies

Choose the right chunker based on your document type and use case.

TextChunker

Simple character-based splitting with optional overlap. Best for uniform content like logs or transcripts.

typescript

import { TextChunker } from '@agentionai/agents';

const chunker = new TextChunker({
  chunkSize: 1000,      // Characters per chunk
  chunkOverlap: 200,    // Character overlap between chunks
});

const chunks = await chunker.chunk(text, {
  sourceId: 'doc-123',
  sourcePath: '/docs/readme.md',
});

console.log(`Created ${chunks.length} chunks`);

Use when:

Processing uniform-length text
Overlap is important for context preservation
You want predictable chunk sizes

RecursiveChunker

Intelligent splitting on semantic boundaries (paragraphs → sentences → words). Best for structured documents like markdown or documentation.

typescript

import { RecursiveChunker } from '@agentionai/agents';

const chunker = new RecursiveChunker({
  chunkSize: 1000,
  chunkOverlap: 100,
  separators: ['\n\n', '\n', '. ', ' '],  // Try in order
});

const chunks = await chunker.chunk(text, {
  sourceId: 'doc-123',
  metadata: { type: 'documentation' },
});

The chunker tries separators in order, falling back to smaller ones as needed:

\n\n - Paragraphs (largest semantic unit)
\n - Lines
. - Sentences
- Words
Character-based fallback

Use when:

Processing markdown, articles, or documentation
Semantic coherence is important
Documents have clear structure

TokenChunker

Token-aware splitting using the tokenx library. Ensures chunks fit within LLM token limits with ~96% accuracy.

typescript

import { TokenChunker } from '@agentionai/agents';

const chunker = new TokenChunker({
  chunkSize: 500,       // Tokens per chunk (not characters)
  chunkOverlap: 50,     // Token overlap
});

const chunks = await chunker.chunk(text);

// Each chunk includes token count in metadata
console.log(chunks[0].metadata.tokenCount);  // e.g., 487

Use when:

You have strict token budget constraints
Working with multiple languages (token count varies)
Precise LLM context management is critical

Chunk Metadata

Each chunk includes rich metadata for tracking and linking:

typescript

interface ChunkMetadata {
  // Position & linking
  chunkIndex: number;              // Position in sequence
  totalChunks: number;             // Total chunk count
  previousChunkId: string | null;  // Link to previous chunk
  nextChunkId: string | null;      // Link to next chunk

  // Source tracking
  startOffset: number;             // Character position in original text
  endOffset: number;               // Character position in original text
  sourceId?: string;               // Document identifier
  sourcePath?: string;             // File path

  // Content info
  charCount: number;               // Number of characters
  tokenCount?: number;             // Estimated tokens (TokenChunker only)
  hash: string;                    // SHA-256 hash for deduplication

  // Structure
  sectionTitle?: string;           // Detected section heading

  // Custom metadata
  [key: string]: unknown;          // User-provided values
}

Chunk Processing

Apply transformations or filters to chunks after splitting:

typescript

const chunker = new TextChunker({
  chunkSize: 500,
  chunkProcessor: async (chunk, index, allChunks) => {
    // Filter out very short chunks
    if (chunk.content.length < 50) {
      return null;  // Skip this chunk
    }

    // Add custom metadata
    return {
      ...chunk,
      metadata: {
        ...chunk.metadata,
        wordCount: chunk.content.split(/\s+/).length,
        processedAt: new Date().toISOString(),
      },
    };
  },
});

const chunks = await chunker.chunk(text);

Processors can:

Filter chunks based on content
Add computed metadata
Transform content (e.g., normalize whitespace)
Return null to skip a chunk

Ingestion Pipeline

The pipeline orchestrates the full workflow: chunk → embed → store.

Basic Ingestion

typescript

import { IngestionPipeline, RecursiveChunker } from '@agentionai/agents';
import { OpenAIEmbeddings } from '@agentionai/agents/vectorstore';
import { LanceDBVectorStore } from '@agentionai/agents/vectorstore';

// Create pipeline components
const chunker = new RecursiveChunker({
  chunkSize: 1000,
  chunkOverlap: 100,
});

const embeddings = new OpenAIEmbeddings({
  model: 'text-embedding-3-small',
});

const store = await LanceDBVectorStore.create({
  name: 'my-documents',
  uri: './data/documents',
  tableName: 'chunks',
  embeddings,
});

// Create pipeline
const pipeline = new IngestionPipeline(chunker, embeddings, store);

// Ingest a document
const result = await pipeline.ingest(documentText, {
  sourceId: 'doc-001',
  sourcePath: '/docs/guide.md',
  batchSize: 50,
  onProgress: ({ phase, processed, total }) => {
    console.log(`${phase}: ${processed}/${total}`);
  },
});

console.log(`Stored ${result.chunksStored} chunks in ${result.duration}ms`);

Batch Ingestion

Process multiple documents efficiently:

typescript

const documents = [
  {
    text: 'Document 1 content...',
    options: {
      sourceId: 'doc-1',
      metadata: { author: 'Alice' },
    },
  },
  {
    text: 'Document 2 content...',
    options: {
      sourceId: 'doc-2',
      metadata: { author: 'Bob' },
    },
  },
];

const result = await pipeline.ingestMany(documents, {
  batchSize: 100,
  onProgress: ({ phase, processed, total }) => {
    console.log(`${phase}: ${processed}/${total}`);
  },
});

console.log(`Total chunks stored: ${result.chunksStored}`);

Pre-chunked Data

If you've already chunked your data:

typescript

const chunks = await chunker.chunk(text);

// Do custom processing or filtering...

const result = await pipeline.ingestChunks(chunks, {
  batchSize: 50,
});

Progress Monitoring

Track ingestion progress across three phases:

typescript

const result = await pipeline.ingest(text, {
  onProgress: (event) => {
    console.log(`Phase: ${event.phase}`);              // "chunking" | "embedding" | "storing"
    console.log(`Processed: ${event.processed}`);      // Items done in this phase
    console.log(`Total: ${event.total}`);              // Total items in this phase
    console.log(`Batch: ${event.currentBatch}/${event.totalBatches}`);  // For batch phases

    // Update UI progress bar
    const progress = (event.processed / event.total) * 100;
    updateProgressBar(progress);
  },
});

Phases:

chunking - Text is split into chunks
embedding - Chunks are embedded in batches
storing - Embeddings are stored in the vector database

Error Handling

Control how errors are handled during ingestion:

typescript

const result = await pipeline.ingest(text, {
  onError: (error, chunk) => {
    console.error(`Error on chunk ${chunk.id}:`, error.message);

    // Return 'skip' to continue with next chunk
    // Return 'abort' to stop entire ingestion
    if (error.message.includes('rate limit')) {
      return 'skip';  // Skip rate-limited chunks
    } else {
      return 'abort';  // Stop on other errors
    }
  },
});

// Check for errors in result
if (!result.success) {
  console.log(`Ingestion aborted. Errors: ${result.errors.length}`);
}

result.errors.forEach(({ chunk, error }) => {
  console.error(`Failed: ${chunk.id} - ${error.message}`);
});

Ingestion Result

The pipeline returns detailed metrics:

typescript

interface IngestionResult {
  success: boolean;           // Completed without abort
  chunksProcessed: number;    // Total chunks created
  chunksSkipped: number;      // Duplicates or filtered
  chunksStored: number;       // Successfully stored
  errors: Array<{             // Errors encountered
    chunk: Chunk;
    error: Error;
  }>;
  duration: number;           // Total time in ms
}

Duplicate Detection

Skip chunks that already exist in the store:

typescript

const result = await pipeline.ingest(text, {
  skipDuplicates: true,  // Enable duplicate detection
});

console.log(`Skipped ${result.chunksSkipped} duplicate chunks`);

Note: Requires the vector store to support hash-based lookup.

Custom ID Generation

Control how chunk IDs are generated:

typescript

const chunker = new TextChunker({
  chunkSize: 500,
  idGenerator: (content, index, sourceId) => {
    // Generate custom IDs
    const timestamp = Date.now();
    return `${sourceId}-${timestamp}-${index}`;
  },
});

Advanced: Custom Chunking

Implement your own chunker by extending the base class:

typescript

import { Chunker, ChunkerConfig } from '@agentionai/agents/chunking';

class MyChunker extends Chunker {
  readonly name = 'MyChunker';

  protected splitText(text: string): string[] {
    // Implement your splitting logic
    return [];
  }
}

const chunker = new MyChunker({ chunkSize: 1000 });

Best Practices

Choose the right chunker - TextChunker for uniform data, RecursiveChunker for structured docs, TokenChunker for LLM constraints
Set appropriate overlap - 10-20% overlap helps with context preservation
Monitor progress - Use callbacks for user feedback and debugging
Handle errors gracefully - Decide whether to skip or abort on errors
Track source information - Include sourceId and sourcePath for traceability
Use batch processing - Larger batches are more efficient but use more memory
Add custom metadata - Include document type, author, timestamp, etc. for filtering
Test chunk size - Different content types may need different sizes

Comparison

Feature	TextChunker	RecursiveChunker	TokenChunker
Speed	Very fast	Fast	Fast
Semantic awareness	No	Yes	No
Token aware	No	No	Yes
Best for	Logs, transcripts	Markdown, documentation	LLM context limits
Complexity	Low	Medium	Medium

Examples

See the complete example implementation:

bash

npm run example -- examples/ingestion-pipeline.ts

This demonstrates:

All three chunker types
Custom chunk processors
Full ingestion pipeline with vector storage
Batch document ingestion
Search on ingested documents

Chunking and Ingestion ​

Overview ​

Chunking Strategies ​

TextChunker ​

RecursiveChunker ​

TokenChunker ​

Chunk Metadata ​

Chunk Processing ​

Ingestion Pipeline ​

Basic Ingestion ​

Batch Ingestion ​

Pre-chunked Data ​

Progress Monitoring ​

Error Handling ​

Ingestion Result ​

Duplicate Detection ​

Custom ID Generation ​

Advanced: Custom Chunking ​

Best Practices ​

Comparison ​

Examples ​

Chunking and Ingestion

Overview

Chunking Strategies

TextChunker

RecursiveChunker

TokenChunker

Chunk Metadata

Chunk Processing

Ingestion Pipeline

Basic Ingestion

Batch Ingestion

Pre-chunked Data

Progress Monitoring

Error Handling

Ingestion Result

Duplicate Detection

Custom ID Generation

Advanced: Custom Chunking

Best Practices

Comparison

Examples