Skip to content

Multimodality

The clients supports various media types including images and PDFs, allowing you to create rich multimodal applications.

Supported Media Types

Media Type Supported Formats Source Types
Images PNG, JPEG, GIF, WebP File path, URL, base64
PDFs PDF documents File path, base64

Basic Image Input

Single Image from File

from datapizza.clients.openai import OpenAIClient
from datapizza.type import Media, MediaBlock, TextBlock

client = OpenAIClient(
    api_key="your-api-key",
    model="gpt-4o"  # Vision models required for images
)

# Create image media object
image = Media(
    media_type="image",
    source_type="path",
    source="image.png", # Use the correct path
    extension="png"
)

# Create media block
media_block = MediaBlock(media=image)
text_block = TextBlock(content="What do you see in this image?")

# Send multimodal input
response = client.invoke(
    input=[text_block, media_block],
    max_tokens=200
)

print(response.text)

Image from URL

# Image from URL
image_url = Media(
    media_type="image",
    source_type="url",
    source="https://example.com/image.png",
    extension="png"
)

response = client.invoke(
    input=[
        TextBlock(content="Describe this image"),
        MediaBlock(media=image_url)
    ]
)
print(response.text)

Image from Base64

import base64

# Read and encode image
with open("image.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

image_b64 = Media(
    media_type="image",
    source_type="base64",
    source=base64_image,
    extension="png"
)

response = client.invoke(
    input=[
        TextBlock(content="Analyze this image"),
        MediaBlock(media=image_b64)
    ]
)
print(response.text)

Multiple Images

Compare or analyze multiple images in a single request:

# Multiple images for comparison
image1 = Media(
    media_type="image",
    source_type="path",
    source="before.png",
    extension="png"
)

image2 = Media(
    media_type="image",
    source_type="path",
    source="after.png",
    extension="png"
)

response = client.invoke(
    input=[
        TextBlock(content="Compare these two images and describe the differences"),
        MediaBlock(media=image1),
        MediaBlock(media=image2)
    ],
    max_tokens=300
)

print(response.text)

Working with PDFs

# PDF from file path
pdf_doc = Media(
    media_type="pdf",
    source_type="path",
    source="document.pdf",
    extension="pdf"
)

response = client.invoke(
    input=[
        TextBlock(content="Summarize the key points from this document"),
        MediaBlock(media=pdf_doc)
    ],
    max_tokens=500
)

print(response.text)

Working with Audio

Google handle audio inline

pip install datapizza-ai-clients-google
from datapizza.clients.google import GoogleClient
from datapizza.type import Media, MediaBlock, TextBlock

client = GoogleClient(
    api_key="YOUR_API_KEY",
    model="gemini-2.0-flash-exp"
)
# PDF from file path
media = Media(
    media_type="audio",
    source_type="path",
    source="sample.mp3",
    extension="mp3"
)

response = client.invoke(
    input=[
        TextBlock(content="Summarize the key points from this audio file"),
        MediaBlock(media=media)
    ],
)

print(response.text)