Skip to content

Splitters

Splitters are pipeline components that divide large text content into smaller, manageable chunks. They help optimize content for processing, storage, and retrieval in AI applications by creating appropriately sized segments while preserving context and meaning.

Installation

All splitters are included with datapizza-ai-core and require no additional installation.

Available Splitters

Core Splitters (Included by Default)

  • RecursiveSplitter - Recursively divides text using multiple splitting strategies
  • TextSplitter - Basic text splitter for general-purpose chunking
  • NodeSplitter - Splitter for Node objects preserving hierarchical structure
  • PDFImageSplitter - Specialized splitter for PDF content with images

Common Features

  • Multiple splitting strategies for different content types
  • Configurable chunk sizes and overlap
  • Context preservation through overlapping
  • Support for structured content (nodes, PDFs, etc.)
  • Metadata preservation during splitting
  • Spatial layout awareness for document content

Usage Patterns

Basic Text Splitting

from datapizza.modules.splitters import RecursiveSplitter

splitter = RecursiveSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter(long_text_content)

Document Processing Pipeline

from datapizza.modules.parsers import TextParser
from datapizza.modules.splitters import NodeSplitter

parser = TextParser()
splitter = NodeSplitter(chunk_size=800, preserve_structure=True)

document = parser.parse(text_content)
structured_chunks = splitter(document)

Choosing the Right Splitter

  • RecursiveSplitter: Best for general text content, articles, and most use cases
  • TextSplitter: Simple splitting for basic text without complex requirements
  • NodeSplitter: When working with structured Node objects from parsers
  • PDFImageSplitter: Specifically for PDF content with images and complex layouts
  • BBoxMerger: Utility for processing documents with spatial layout information