RecursiveSplitter
datapizza.modules.splitters.RecursiveSplitter
Bases: Splitter
The RecursiveSplitter takes leaf nodes from a tree document structure and groups them into Chunk objects until reaching the maximum character limit. Each leaf Node represents the smallest unit of content that can be grouped.
__init__
Initialize the RecursiveSplitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_char
|
int
|
The maximum number of characters per chunk |
5000
|
overlap
|
int
|
The number of characters to overlap between chunks |
0
|
Usage
from datapizza.modules.parsers import TextParser
from datapizza.modules.splitters import RecursiveSplitter
splitter = RecursiveSplitter(
max_char=10,
overlap=1,
)
# Parse text into nodes because RecursiveSplitter need Node
parser = TextParser()
document = parser.parse("""
This is the first section of the document.
It contains important information about the topic.
This is the second section with more details.
It provides additional context and examples.
The final section concludes the document.
It summarizes the key points discussed.
""")
chunks = splitter.split(document)
print(chunks)
Features
- Uses multiple separator strategies in order of preference
- Recursive approach ensures optimal chunk boundaries
- Configurable chunk size and overlap for context preservation
- Handles various content types with appropriate separator selection
- Preserves content structure while maintaining size limits