RecursiveSplitter

datapizza.modules.splitters.RecursiveSplitter

Bases: Splitter

The RecursiveSplitter takes leaf nodes from a tree document structure and groups them into Chunk objects until reaching the maximum character limit. Each leaf Node represents the smallest unit of content that can be grouped.

init

__init__(max_char=5000, overlap=0)

Initialize the RecursiveSplitter.

Parameters:

Name	Type	Description	Default
`max_char`	`int`	The maximum number of characters per chunk	`5000`
`overlap`	`int`	The number of characters to overlap between chunks	`0`

split

split(node)

Split the node into chunks.

Parameters:

Name	Type	Description	Default
`node`	`Node`	The node to split	required

Returns:

Type	Description
`list[Chunk]`	A list of chunks

Usage

from datapizza.modules.parsers import TextParser
from datapizza.modules.splitters import RecursiveSplitter

splitter = RecursiveSplitter(
    max_char=10,
    overlap=1,
)

# Parse text into nodes because RecursiveSplitter need Node
parser = TextParser()
document = parser.parse("""
This is the first section of the document.
It contains important information about the topic.

This is the second section with more details.
It provides additional context and examples.

The final section concludes the document.
It summarizes the key points discussed.
""")

chunks = splitter.split(document)
print(chunks)

Features

Uses multiple separator strategies in order of preference
Recursive approach ensures optimal chunk boundaries
Configurable chunk size and overlap for context preservation
Handles various content types with appropriate separator selection
Preserves content structure while maintaining size limits