Skip to content

RecursiveSplitter

datapizza.modules.splitters.RecursiveSplitter

Bases: Splitter

The RecursiveSplitter takes leaf nodes from a tree document structure and groups them into Chunk objects until reaching the maximum character limit. Each leaf Node represents the smallest unit of content that can be grouped.

__init__

__init__(max_char=5000, overlap=0)

Initialize the RecursiveSplitter.

Parameters:

Name Type Description Default
max_char int

The maximum number of characters per chunk

5000
overlap int

The number of characters to overlap between chunks

0

split

split(node)

Split the node into chunks.

Parameters:

Name Type Description Default
node Node

The node to split

required

Returns:

Type Description
list[Chunk]

A list of chunks

Usage

from datapizza.modules.parsers import TextParser
from datapizza.modules.splitters import RecursiveSplitter

splitter = RecursiveSplitter(
    max_char=10,
    overlap=1,
)

# Parse text into nodes because RecursiveSplitter need Node
parser = TextParser()
document = parser.parse("""
This is the first section of the document.
It contains important information about the topic.

This is the second section with more details.
It provides additional context and examples.

The final section concludes the document.
It summarizes the key points discussed.
""")

chunks = splitter.split(document)
print(chunks)

Features

  • Uses multiple separator strategies in order of preference
  • Recursive approach ensures optimal chunk boundaries
  • Configurable chunk size and overlap for context preservation
  • Handles various content types with appropriate separator selection
  • Preserves content structure while maintaining size limits