Skip to content

PDFImageSplitter

datapizza.modules.splitters.PDFImageSplitter

Bases: Splitter

Splits a PDF document into individual pages, saves each page as an image using fitz, and returns metadata about each page as a Chunk object.

__init__

__init__(
    image_format="png",
    output_base_dir="output_images",
    dpi=300,
)

Initializes the Splitter.

Parameters:

Name Type Description Default
image_format Literal['png', 'jpeg']

The format to save the images in ('png' or 'jpeg'). Defaults to 'png'.

'png'
output_base_dir str | Path

The base directory where images for processed PDFs will be saved. A subdirectory will be created for each PDF. Defaults to 'output_images'.

'output_images'
dpi int

Dots Per Inch for rendering the PDF page to an image. Higher values increase resolution and file size. Defaults to 300.

300

split

split(pdf_path)

Processes the PDF using fitz: converts pages to images and returns Chunk objects.

Parameters:

Name Type Description Default
pdf_path str | Path

The path to the input PDF file.

required

Returns:

Type Description
list[Chunk]

A list of Chunk objects, one for each page of the PDF.

Usage

from datapizza.modules.splitters import PDFImageSplitter

splitter = PDFImageSplitter()

pdf_chunks = splitter("pdf_path")

Features

  • Specialized handling of PDF document structure
  • Preserves image data and visual elements
  • Maintains spatial layout information
  • Includes page-level metadata and coordinates
  • Handles complex document layouts with mixed content
  • Optimized for PDF content from document intelligence services

Examples

Basic PDF Content Splitting

from datapizza.modules.splitters import PDFImageSplitter

# Split while preserving images and layout
pdf_splitter = PDFImageSplitter()

pdf_chunks = pdf_splitter("pdf_path")

# Examine chunks with visual content
for i, chunk in enumerate(pdf_chunks):
    print(f"Chunk {i+1}:")
    print(f"  Content length: {len(chunk.content)}")
    print(f"  Page: {chunk.metadata.get('page_number', 'unknown')}")

    if hasattr(chunk, 'media') and chunk.media:
        print(f"  Media elements: {len(chunk.media)}")
        for media in chunk.media:
            print(f"    Type: {media.media_type}")

    if 'boundingRegions' in chunk.metadata:
        print(f"  Bounding regions: {len(chunk.metadata['boundingRegions'])}")

    print("---")