TextSplitter
datapizza.modules.splitters.TextSplitter
Bases: Splitter
A basic text splitter that operates directly on strings rather than Node objects. Unlike other splitters that work with Node types, this splitter takes raw text input and splits it into chunks while maintaining configurable size and overlap parameters.
__init__
Initialize the TextSplitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_char
|
int
|
The maximum number of characters per chunk |
5000
|
overlap
|
int
|
The number of characters to overlap between chunks |
0
|
split
Split the text into chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to split |
required |
Returns:
Type | Description |
---|---|
list[Chunk]
|
A list of chunks |
Usage
from datapizza.modules.splitters import TextSplitter
splitter = TextSplitter(
max_char=500,
overlap=50
)
chunks = splitter.split(text_content)
Features
- Simple, straightforward text splitting algorithm
- Configurable chunk size and overlap
- Lightweight implementation for basic splitting needs
- Preserves character-level accuracy in chunk boundaries
- Minimal overhead for high-performance applications
Examples
Basic Usage
from datapizza.modules.splitters import TextSplitter
splitter = TextSplitter(max_char=50, overlap=5)
text = """
This is a sample text that we want to split into smaller chunks.
The TextSplitter will divide this content based on the specified
chunk size and overlap parameters. This ensures that information
is preserved while creating manageable pieces of content.
"""
chunks = splitter.split(text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {len(chunk.text)} chars")
print(f"Content: {chunk.text}")
print("---")