AzureParser
A document parser that uses Azure AI Document Intelligence to extract structured content from PDFs and other documents.
Installation
datapizza.modules.parsers.azure.AzureParser
Bases: Parser
Parser that creates a hierarchical tree structure from Azure AI Document Intelligence response. The hierarchy goes from document -> pages -> paragraphs/tables -> lines/cells -> words.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
api_key
|
str
|
str |
required |
endpoint
|
str
|
str |
required |
result_type
|
str
|
str = "markdown", "text" |
'text'
|
parse
Parse a Document with Azure AI Document Intelligence into a Node structure.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
Path to the document |
required |
Returns:
Type | Description |
---|---|
Node
|
A Node representing the document with hierarchical structure |
parse_with_azure_ai
Parse a Document with Azure AI Document Intelligence into a json dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
Path to the document |
required |
Returns:
Type | Description |
---|---|
dict
|
A dictionary with the Azure AI Document Intelligence response |
Usage
from datapizza.modules.parsers.azure import AzureParser
parser = AzureParser(
api_key="your-azure-key",
endpoint="https://your-endpoint.cognitiveservices.azure.com/",
result_type="text"
)
document_node = parser.parse("document.pdf")
Parameters
api_key
(str): Azure AI Document Intelligence API keyendpoint
(str): Azure service endpoint URLresult_type
(str): Output format - "text" or "markdown" (default: "text")
Features
- Creates hierarchical document structure: document → sections → paragraphs/tables/figures
- Extracts bounding regions and spatial layout information
- Handles tables, figures, and complex document layouts
- Preserves metadata including page numbers and coordinates
- Supports both sync and async processing
- Converts media elements to base64 images with coordinates
Node Types Created
DOCUMENT
: Root document containerSECTION
: Document sectionsPARAGRAPH
: Text paragraphs with contentTABLE
: Tables with markdown representationFIGURE
: Images and figures with media data
Examples
Basic Document Processing
from datapizza.modules.parsers.azure import AzureParser
import os
parser = AzureParser(
api_key=os.getenv("AZURE_DOC_INTELLIGENCE_KEY"),
endpoint=os.getenv("AZURE_DOC_INTELLIGENCE_ENDPOINT"),
result_type="markdown"
)
# Parse document
document = parser.parse("complex_document.pdf")
# Access hierarchical structure
for section in document.children:
for paragraph in section.children:
print(f"Content: {paragraph.content}")
print(f"Bounding regions: {paragraph.metadata.get('boundingRegions', [])}")