Skip to content

NodeSplitter

datapizza.modules.splitters.NodeSplitter

Bases: Splitter

A splitter that traverses a document tree from the root node. If the root node's content is smaller than max_chars, it becomes a single chunk. Otherwise, it recursively processes the node's children, creating chunks from the first level of children that fit within max_chars, continuing deeper into the tree structure as needed.

__init__

__init__(max_char=5000)

Initialize the NodeSplitter.

Parameters:

Name Type Description Default
max_char int

The maximum number of characters per chunk

5000

split

split(node)

Split the node into chunks.

Parameters:

Name Type Description Default
node Node

The node to split

required

Returns:

Type Description
list[Chunk]

A list of chunks

Usage

from datapizza.modules.splitters import NodeSplitter

splitter = NodeSplitter(
    max_char=800,
)

node_chunks = splitter.split(document_node)

Features

  • Maintains Node object structure and hierarchy
  • Preserves metadata from original nodes
  • Respects node boundaries when possible
  • Supports both structure-preserving and flattened chunking
  • Handles nested node relationships intelligently

Examples

Basic Node Splitting

from datapizza.modules.parsers import TextParser
from datapizza.modules.splitters import NodeSplitter

# Parse text into nodes
parser = TextParser()
document = parser.parse("""
This is the first section of the document.
It contains important information about the topic.

This is the second section with more details.
It provides additional context and examples.

The final section concludes the document.
It summarizes the key points discussed.
""")

splitter = NodeSplitter(
    max_char=150,
)

chunks = splitter.split(document)

# Examine the structured chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:")
    print(f"  Content length: {len(chunk.text)}")
    print(f"  Content preview: {chunk.text[:80]}...")
    print("---")