Parsers

Parsers are pipeline components that convert documents into structured hierarchical Node representations. They extract text, layout information, and metadata from various document formats to create tree-like data structures for further processing.

Each parser should return a Node object, which is a hierarchical representation of the document content.

If you write a custom parser that returns a different type of object (for example, the plain text of the document content), you must use a TreeBuilder to convert it into a Node.

Available Parsers

Core Parsers (Included by Default)

TextParser - Simple text parser for plain text content

Optional Parsers (Separate Installation Required)

AzureParser - Azure AI Document Intelligence parser for PDFs and documents
DoclingParser - Docling-based parser for PDFs with layout preservation and media extraction

Common Usage Patterns

Basic Text Processing

from datapizza.modules.parsers.text_parser import parse_text

# Process plain text
document = parse_text("Your text content here")

Document Processing Pipeline

from datapizza.modules.parsers import TextParser
from datapizza.modules.splitters import RecursiveSplitter

# Create processing pipeline
parser = TextParser()
splitter = RecursiveSplitter(chunk_size=1000)

# Process document
document = parser.parse(text_content)
chunks = splitter(document.content)