Skip to content

Captioners

Captioners are pipeline components that generate captions and descriptions for media content such as images, figures, and tables. They use LLM clients to analyze visual content and produce descriptive text.

LLMCaptioner

datapizza.modules.captioners.LLMCaptioner

Bases: NodeCaptioner

Captioner that uses an LLM client to caption a node.

__init__

__init__(
    client,
    max_workers=3,
    system_prompt_table="Generate concise captions for tables.",
    system_prompt_figure="Generate descriptive captions for figures.",
)

Captioner that uses an LLM client to caption a node. Args: client: The LLM client to use. max_workers: The maximum number of workers to use. in sync mode is the number of threads spawned, in async mode is the number of batches. system_prompt_table: The system prompt to use for table captioning. system_prompt_figure: The system prompt to use for figure captioning.

a_caption async

a_caption(node)

async Caption a node. Args: node: The node to caption.

Returns:

Type Description
Node

The same node with the caption.

a_caption_media async

a_caption_media(media, system_prompt=None)

async Caption image. Args: media: The media to caption. system_prompt: Optional system prompt to guide the captioning.

Returns:

Type Description
str

The string caption.

caption

caption(node)

Caption a node. Args: node: The node to caption.

Returns:

Type Description
Node

The same node with the caption.

caption_media

caption_media(media, system_prompt=None)

Caption an image. Args: media: The media to caption. system_prompt: Optional system prompt to guide the captioning.

Returns:

Type Description
str

The string caption.

A captioner that uses language models to generate captions for media nodes (figures and tables) within document hierarchies.

from datapizza.clients.openai import OpenAIClient
from datapizza.modules.captioners import LLMCaptioner
from datapizza.type import ROLE, Media, MediaNode, NodeType

client = OpenAIClient(api_key="OPENAI_API_KEY", model="gpt-4o")
captioner = LLMCaptioner(
    client=client,
    max_workers=3,
    system_prompt_table="Describe this table in detail.",
    system_prompt_figure="Describe this figure/image in detail."
)

document_node = MediaNode( node_type=NodeType.FIGURE, children=[], metadata={}, media=Media(source_type="path", source="gogole.png", extension="png", media_type="image"))
captioned_document = captioner(document_node)
print(captioned_document)

Parameters:

  • client (Client): The LLM client to use for caption generation
  • max_workers (int): Maximum number of concurrent workers for parallel processing (default: 3)
  • system_prompt_table (str, optional): System prompt for table captioning
  • system_prompt_figure (str, optional): System prompt for figure captioning

Features:

  • Automatically finds all media nodes (figures and tables) in a document hierarchy
  • Generates captions using configurable system prompts
  • Supports concurrent processing for better performance
  • Creates new paragraph nodes containing the original content plus generated captions
  • Preserves original node metadata and structure
  • Supports both sync and async processing

Supported Node Types:

  • FIGURE: Images and visual figures
  • TABLE: Tables and tabular data

Output Format:

The captioner creates new paragraph nodes with content in the format:

{original_content} <{node_type}> [{generated_caption}]