PDFImageSplitter
datapizza.modules.splitters.PDFImageSplitter
Bases: Splitter
Splits a PDF document into individual pages, saves each page as an image using fitz, and returns metadata about each page as a Chunk object.
__init__
Initializes the Splitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_format
|
Literal['png', 'jpeg']
|
The format to save the images in ('png' or 'jpeg'). Defaults to 'png'. |
'png'
|
output_base_dir
|
str | Path
|
The base directory where images for processed PDFs will be saved. A subdirectory will be created for each PDF. Defaults to 'output_images'. |
'output_images'
|
dpi
|
int
|
Dots Per Inch for rendering the PDF page to an image. Higher values increase resolution and file size. Defaults to 300. |
300
|
split
Processes the PDF using fitz: converts pages to images and returns Chunk objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pdf_path
|
str | Path
|
The path to the input PDF file. |
required |
Returns:
Type | Description |
---|---|
list[Chunk]
|
A list of Chunk objects, one for each page of the PDF. |
Usage
from datapizza.modules.splitters import PDFImageSplitter
splitter = PDFImageSplitter()
pdf_chunks = splitter("pdf_path")
Features
- Specialized handling of PDF document structure
- Preserves image data and visual elements
- Maintains spatial layout information
- Includes page-level metadata and coordinates
- Handles complex document layouts with mixed content
- Optimized for PDF content from document intelligence services
Examples
Basic PDF Content Splitting
from datapizza.modules.splitters import PDFImageSplitter
# Split while preserving images and layout
pdf_splitter = PDFImageSplitter()
pdf_chunks = pdf_splitter("pdf_path")
# Examine chunks with visual content
for i, chunk in enumerate(pdf_chunks):
print(f"Chunk {i+1}:")
print(f" Content length: {len(chunk.content)}")
print(f" Page: {chunk.metadata.get('page_number', 'unknown')}")
if hasattr(chunk, 'media') and chunk.media:
print(f" Media elements: {len(chunk.media)}")
for media in chunk.media:
print(f" Type: {media.media_type}")
if 'boundingRegions' in chunk.metadata:
print(f" Bounding regions: {len(chunk.metadata['boundingRegions'])}")
print("---")