Skip to content

Metatagger

Metataggers are pipeline components that add metadata tags to content chunks using language models. They analyze text content and generate relevant keywords, tags, or other metadata to enhance content discoverability and organization.

datapizza.modules.metatagger.KeywordMetatagger

Bases: Metatagger

Keyword metatagger that uses an LLM client to add metadata to a chunk.

__init__

__init__(
    client,
    max_workers=3,
    system_prompt=None,
    user_prompt=None,
    keyword_name="keywords",
)

Parameters:

Name Type Description Default
client Client

The LLM client to use.

required
max_workers int

The maximum number of workers to use.

3
system_prompt str | None

The system prompt to use.

None
user_prompt str | None

The user prompt to use.

None
keyword_name str

The name of the keyword field.

'keywords'

a_tag async

a_tag(chunks)

async Add metadata to a chunk.

tag

tag(chunks)

Add metadata to a chunk.

A metatagger that uses language models to generate keywords and metadata for text chunks.

from datapizza.modules.metatagger import KeywordMetatagger
from datapizza.clients.openai import OpenAIClient

client = OpenAIClient(api_key="your-api-key")
metatagger = KeywordMetatagger(
    client=client,
    max_workers=3,
    system_prompt="Generate relevant keywords for the given text.",
    user_prompt="Extract 5-10 keywords from this text:",
    keyword_name="keywords"
)

# Process chunks
tagged_chunks = metatagger.tag(chunks)

Features:

  • Processes chunks in parallel for better performance
  • Configurable prompts for different keyword extraction strategies
  • Adds generated keywords to chunk metadata
  • Supports custom metadata field naming
  • Handles both individual chunks and lists of chunks
  • Uses memory-based conversation for consistent prompting

Input/Output:

  • Input: Chunk objects or lists of Chunk objects
  • Output: Same Chunk objects with additional metadata containing generated keywords

Usage Examples

Basic Keyword Extraction

import uuid

from datapizza.clients.openai import OpenAIClient
from datapizza.modules.metatagger import KeywordMetatagger
from datapizza.type import Chunk

# Initialize client and metatagger
client = OpenAIClient(api_key="OPENAI_API_KEY", model="gpt-4o")
metatagger = KeywordMetatagger(
    client=client,
    system_prompt="You are a keyword extraction expert. Generate relevant, concise keywords.",
    user_prompt="Extract 5-8 important keywords from this text:",
    keyword_name="keywords"
)

# Process chunks
chunks = [
    Chunk(id=str(uuid.uuid4()), text="Machine learning algorithms are transforming healthcare diagnostics."),
    Chunk(id=str(uuid.uuid4()), text="Climate change impacts ocean temperatures and marine ecosystems.")
]

tagged_chunks = metatagger.tag(chunks)

# Access generated keywords
for chunk in tagged_chunks:
    print(f"Content: {chunk.text}")
    print(f"Keywords: {chunk.metadata.get('keywords', [])}")