Chunking and Ingestion
Document chunking and ingestion are essential for building RAG (Retrieval-Augmented Generation) applications. Chunking breaks large documents into manageable pieces that fit within LLM context windows, while the ingestion pipeline orchestrates embedding and storage.
Overview
The chunking and ingestion system includes:
- Chunkers - Split text into pieces using different strategies
- IngestionPipeline - Orchestrates chunking, embedding, and storage
- Metadata Tracking - Maintains context and linking between chunks
- Progress Monitoring - Real-time updates during processing
Chunking Strategies
Choose the right chunker based on your document type and use case.
TextChunker
Simple character-based splitting with optional overlap. Best for uniform content like logs or transcripts.
import { TextChunker } from '@agentionai/agents';
const chunker = new TextChunker({
chunkSize: 1000, // Characters per chunk
chunkOverlap: 200, // Character overlap between chunks
});
const chunks = await chunker.chunk(text, {
sourceId: 'doc-123',
sourcePath: '/docs/readme.md',
});
console.log(`Created ${chunks.length} chunks`);Use when:
- Processing uniform-length text
- Overlap is important for context preservation
- You want predictable chunk sizes
RecursiveChunker
Intelligent splitting on semantic boundaries (paragraphs → sentences → words). Best for structured documents like markdown or documentation.
import { RecursiveChunker } from '@agentionai/agents';
const chunker = new RecursiveChunker({
chunkSize: 1000,
chunkOverlap: 100,
separators: ['\n\n', '\n', '. ', ' '], // Try in order
});
const chunks = await chunker.chunk(text, {
sourceId: 'doc-123',
metadata: { type: 'documentation' },
});The chunker tries separators in order, falling back to smaller ones as needed:
\n\n- Paragraphs (largest semantic unit)\n- Lines.- Sentences- Words- Character-based fallback
Use when:
- Processing markdown, articles, or documentation
- Semantic coherence is important
- Documents have clear structure
TokenChunker
Token-aware splitting using the tokenx library. Ensures chunks fit within LLM token limits with ~96% accuracy.
import { TokenChunker } from '@agentionai/agents';
const chunker = new TokenChunker({
chunkSize: 500, // Tokens per chunk (not characters)
chunkOverlap: 50, // Token overlap
});
const chunks = await chunker.chunk(text);
// Each chunk includes token count in metadata
console.log(chunks[0].metadata.tokenCount); // e.g., 487Use when:
- You have strict token budget constraints
- Working with multiple languages (token count varies)
- Precise LLM context management is critical
Chunk Metadata
Each chunk includes rich metadata for tracking and linking:
interface ChunkMetadata {
// Position & linking
chunkIndex: number; // Position in sequence
totalChunks: number; // Total chunk count
previousChunkId: string | null; // Link to previous chunk
nextChunkId: string | null; // Link to next chunk
// Source tracking
startOffset: number; // Character position in original text
endOffset: number; // Character position in original text
sourceId?: string; // Document identifier
sourcePath?: string; // File path
// Content info
charCount: number; // Number of characters
tokenCount?: number; // Estimated tokens (TokenChunker only)
hash: string; // SHA-256 hash for deduplication
// Structure
sectionTitle?: string; // Detected section heading
// Custom metadata
[key: string]: unknown; // User-provided values
}Chunk Processing
Apply transformations or filters to chunks after splitting:
const chunker = new TextChunker({
chunkSize: 500,
chunkProcessor: async (chunk, index, allChunks) => {
// Filter out very short chunks
if (chunk.content.length < 50) {
return null; // Skip this chunk
}
// Add custom metadata
return {
...chunk,
metadata: {
...chunk.metadata,
wordCount: chunk.content.split(/\s+/).length,
processedAt: new Date().toISOString(),
},
};
},
});
const chunks = await chunker.chunk(text);Processors can:
- Filter chunks based on content
- Add computed metadata
- Transform content (e.g., normalize whitespace)
- Return
nullto skip a chunk
Ingestion Pipeline
The pipeline orchestrates the full workflow: chunk → embed → store.
Basic Ingestion
import { IngestionPipeline, RecursiveChunker } from '@agentionai/agents';
import { OpenAIEmbeddings } from '@agentionai/agents/vectorstore';
import { LanceDBVectorStore } from '@agentionai/agents/vectorstore';
// Create pipeline components
const chunker = new RecursiveChunker({
chunkSize: 1000,
chunkOverlap: 100,
});
const embeddings = new OpenAIEmbeddings({
model: 'text-embedding-3-small',
});
const store = await LanceDBVectorStore.create({
name: 'my-documents',
uri: './data/documents',
tableName: 'chunks',
embeddings,
});
// Create pipeline
const pipeline = new IngestionPipeline(chunker, embeddings, store);
// Ingest a document
const result = await pipeline.ingest(documentText, {
sourceId: 'doc-001',
sourcePath: '/docs/guide.md',
batchSize: 50,
onProgress: ({ phase, processed, total }) => {
console.log(`${phase}: ${processed}/${total}`);
},
});
console.log(`Stored ${result.chunksStored} chunks in ${result.duration}ms`);Batch Ingestion
Process multiple documents efficiently:
const documents = [
{
text: 'Document 1 content...',
options: {
sourceId: 'doc-1',
metadata: { author: 'Alice' },
},
},
{
text: 'Document 2 content...',
options: {
sourceId: 'doc-2',
metadata: { author: 'Bob' },
},
},
];
const result = await pipeline.ingestMany(documents, {
batchSize: 100,
onProgress: ({ phase, processed, total }) => {
console.log(`${phase}: ${processed}/${total}`);
},
});
console.log(`Total chunks stored: ${result.chunksStored}`);Pre-chunked Data
If you've already chunked your data:
const chunks = await chunker.chunk(text);
// Do custom processing or filtering...
const result = await pipeline.ingestChunks(chunks, {
batchSize: 50,
});Progress Monitoring
Track ingestion progress across three phases:
const result = await pipeline.ingest(text, {
onProgress: (event) => {
console.log(`Phase: ${event.phase}`); // "chunking" | "embedding" | "storing"
console.log(`Processed: ${event.processed}`); // Items done in this phase
console.log(`Total: ${event.total}`); // Total items in this phase
console.log(`Batch: ${event.currentBatch}/${event.totalBatches}`); // For batch phases
// Update UI progress bar
const progress = (event.processed / event.total) * 100;
updateProgressBar(progress);
},
});Phases:
- chunking - Text is split into chunks
- embedding - Chunks are embedded in batches
- storing - Embeddings are stored in the vector database
Error Handling
Control how errors are handled during ingestion:
const result = await pipeline.ingest(text, {
onError: (error, chunk) => {
console.error(`Error on chunk ${chunk.id}:`, error.message);
// Return 'skip' to continue with next chunk
// Return 'abort' to stop entire ingestion
if (error.message.includes('rate limit')) {
return 'skip'; // Skip rate-limited chunks
} else {
return 'abort'; // Stop on other errors
}
},
});
// Check for errors in result
if (!result.success) {
console.log(`Ingestion aborted. Errors: ${result.errors.length}`);
}
result.errors.forEach(({ chunk, error }) => {
console.error(`Failed: ${chunk.id} - ${error.message}`);
});Ingestion Result
The pipeline returns detailed metrics:
interface IngestionResult {
success: boolean; // Completed without abort
chunksProcessed: number; // Total chunks created
chunksSkipped: number; // Duplicates or filtered
chunksStored: number; // Successfully stored
errors: Array<{ // Errors encountered
chunk: Chunk;
error: Error;
}>;
duration: number; // Total time in ms
}Duplicate Detection
Skip chunks that already exist in the store:
const result = await pipeline.ingest(text, {
skipDuplicates: true, // Enable duplicate detection
});
console.log(`Skipped ${result.chunksSkipped} duplicate chunks`);Note: Requires the vector store to support hash-based lookup.
Custom ID Generation
Control how chunk IDs are generated:
const chunker = new TextChunker({
chunkSize: 500,
idGenerator: (content, index, sourceId) => {
// Generate custom IDs
const timestamp = Date.now();
return `${sourceId}-${timestamp}-${index}`;
},
});Advanced: Custom Chunking
Implement your own chunker by extending the base class:
import { Chunker, ChunkerConfig } from '@agentionai/agents/chunking';
class MyChunker extends Chunker {
readonly name = 'MyChunker';
protected splitText(text: string): string[] {
// Implement your splitting logic
return [];
}
}
const chunker = new MyChunker({ chunkSize: 1000 });Best Practices
- Choose the right chunker - TextChunker for uniform data, RecursiveChunker for structured docs, TokenChunker for LLM constraints
- Set appropriate overlap - 10-20% overlap helps with context preservation
- Monitor progress - Use callbacks for user feedback and debugging
- Handle errors gracefully - Decide whether to skip or abort on errors
- Track source information - Include
sourceIdandsourcePathfor traceability - Use batch processing - Larger batches are more efficient but use more memory
- Add custom metadata - Include document type, author, timestamp, etc. for filtering
- Test chunk size - Different content types may need different sizes
Comparison
| Feature | TextChunker | RecursiveChunker | TokenChunker |
|---|---|---|---|
| Speed | Very fast | Fast | Fast |
| Semantic awareness | No | Yes | No |
| Token aware | No | No | Yes |
| Best for | Logs, transcripts | Markdown, documentation | LLM context limits |
| Complexity | Low | Medium | Medium |
Examples
See the complete example implementation:
npm run example -- examples/ingestion-pipeline.tsThis demonstrates:
- All three chunker types
- Custom chunk processors
- Full ingestion pipeline with vector storage
- Batch document ingestion
- Search on ingested documents