AzureParser

A document parser that uses Azure AI Document Intelligence to extract structured content from PDFs and other documents.

Installation

pip install datapizza-ai-parsers-azure

datapizza.modules.parsers.azure.AzureParser

Bases: Parser

Parser that creates a hierarchical tree structure from Azure AI Document Intelligence response. The hierarchy goes from document -> pages -> paragraphs/tables -> lines/cells -> words.

Parameters:

Name	Type	Description	Default
`api_key`	`str`	str	required
`endpoint`	`str`	str	required
`result_type`	`str`	str = "markdown", "text"	`'text'`

call

__call__(file_path, metadata=None)

Allow the parser to be called directly as a function.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the document	required
`metadata`	`dict \| None`	Optional metadata to be merged into the root document node	`None`

Returns:

Type	Description
`Node`	A Node representing the document with hierarchical structure

a_parse `async`

a_parse(file_path, metadata=None)

Async version of parse().

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the document	required
`metadata`	`dict \| None`	Optional metadata to be merged into the root document node. Defaults to None.	`None`

Returns:

Type	Description
`Node`	A Node representing the document with hierarchical structure

Raises:

Type	Description
`TypeError`	If metadata is not a dict or None

parse

parse(file_path, metadata=None)

Parse a Document with Azure AI Document Intelligence into a Node structure.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the document	required
`metadata`	`dict \| None`	Optional metadata to be merged into the root document node. Defaults to None.	`None`

Returns:

Type	Description
`Node`	A Node representing the document with hierarchical structure

Raises:

Type	Description
`TypeError`	If metadata is not a dict or None

parse_with_azure_ai

parse_with_azure_ai(file_path)

Parse a Document with Azure AI Document Intelligence into a json dictionary.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the document	required

Returns:

Type	Description
`dict`	A dictionary with the Azure AI Document Intelligence response

Usage

from datapizza.modules.parsers.azure import AzureParser

parser = AzureParser(
    api_key="your-azure-key",
    endpoint="https://your-endpoint.cognitiveservices.azure.com/",
    result_type="text"
)

document_node = parser.parse("document.pdf")

Parameters

api_key (str): Azure AI Document Intelligence API key
endpoint (str): Azure service endpoint URL
result_type (str): Output format - "text" or "markdown" (default: "text")

Features

Creates hierarchical document structure: document → sections → paragraphs/tables/figures
Extracts bounding regions and spatial layout information
Handles tables, figures, and complex document layouts
Preserves metadata including page numbers and coordinates
Supports both sync and async processing
Converts media elements to base64 images with coordinates

Node Types Created

DOCUMENT: Root document container
SECTION: Document sections
PARAGRAPH: Text paragraphs with content
TABLE: Tables with markdown representation
FIGURE: Images and figures with media data

Examples

Basic Document Processing

from datapizza.modules.parsers.azure import AzureParser
import os

parser = AzureParser(
    api_key=os.getenv("AZURE_DOC_INTELLIGENCE_KEY"),
    endpoint=os.getenv("AZURE_DOC_INTELLIGENCE_ENDPOINT"),
    result_type="markdown"
)

# Parse document
document = parser.parse("complex_document.pdf")

# Access hierarchical structure
for section in document.children:
    for paragraph in section.children:
        print(f"Content: {paragraph.content}")
        print(f"Bounding regions: {paragraph.metadata.get('boundingRegions', [])}")

Async Processing

async def process_document():
    document = await parser.a_run("document.pdf")
    return document

# Usage in async context
document = await process_document()

AzureParser

Installation

datapizza.modules.parsers.azure.AzureParser

__call__

a_parse async

parse

parse_with_azure_ai

Usage

Parameters

Features

Node Types Created

Examples

Basic Document Processing

Async Processing

call

a_parse `async`