Parsers
Parsers are pipeline components that convert documents into structured hierarchical Node representations. They extract text, layout information, and metadata from various document formats to create tree-like data structures for further processing.
Each parser should return a Node object, which is a hierarchical representation of the document content.
If you write a custom parser that returns a different type of object (for example, the plain text of the document content), you must use a TreeBuilder to convert it into a Node.
Available Parsers
Core Parsers (Included by Default)
- TextParser - Simple text parser for plain text content
Optional Parsers (Separate Installation Required)
- AzureParser - Azure AI Document Intelligence parser for PDFs and documents
- DoclingParser - Docling-based parser for PDFs with layout preservation and media extraction
Common Usage Patterns
Basic Text Processing
from datapizza.modules.parsers.text_parser import parse_text
# Process plain text
document = parse_text("Your text content here")