ChunkEmbedder

datapizza.embedders.ChunkEmbedder

Bases: PipelineComponent

ChunkEmbedder is a module that given a list of chunks, it put a list of embeddings in each chunk.

init

__init__(
    client,
    model_name=None,
    embedding_name=None,
    batch_size=2047,
)

Initialize the ChunkEmbedder.

Parameters:

Name	Type	Description	Default
`client`	`BaseEmbedder`	The client to use for embedding.	required
`model_name`	`str`	The model name to use for embedding. Defaults to None.	`None`
`embedding_name`	`str`	The name of the embedding to use. Defaults to None.	`None`
`batch_size`	`int`	The batch size to use for embedding. Defaults to 2047.	`2047`

a_embed `async`

a_embed(nodes)

Asynchronously embeds the given list of chunks.

Parameters:

Name	Type	Description	Default
`nodes`	`list[Chunk]`	The list of chunks to embed.	required

Returns:

Type	Description
`list[Chunk]`	list[Chunk]: The list of chunks with embeddings.

embed

embed(nodes)

Embeds the given list of chunks.

Parameters:

Name	Type	Description	Default
`nodes`	`list[Chunk]`	The list of chunks to embed.	required

Returns:

Type	Description
`list[Chunk]`	list[Chunk]: The list of chunks with embeddings.

Usage

from datapizza.embedders import ChunkEmbedder
from datapizza.core.clients import Client

# Initialize with any compatible client
client = Client(...)  # Your client instance
embedder = ChunkEmbedder(
    client=client,
    model_name="text-embedding-ada-002",  # Optional model override
    embedding_name="my_embeddings",       # Optional custom embedding name
    batch_size=100                        # Optional batch size for processing
)

# Embed chunks - adds embeddings to chunk objects
embedded_chunks = embedder.embed(chunks)

Features

Specialized for embedding lists of Chunk objects
Batch processing with configurable batch size
Adds embeddings directly to Chunk objects
Preserves original chunk structure and metadata
Async embedding support with a_embed()
Memory efficient batch processing
Works with any compatible LLM client

Examples

Basic Chunk Embedding

import os

from datapizza.embedders import ChunkEmbedder
from datapizza.embedders.openai import OpenAIEmbedder
from datapizza.type import Chunk
from dotenv import load_dotenv

load_dotenv()

# Create client and embedder
client = OpenAIEmbedder(api_key=os.getenv("OPENAI_API_KEY"))
embedder = ChunkEmbedder(
    client=client,
    model_name="text-embedding-ada-002",
    batch_size=50
)

# Create sample chunks
chunks = [
    Chunk(id="1", text="First chunk of text", metadata={"source": "doc1"}),
    Chunk(id="2", text="Second chunk of text", metadata={"source": "doc2"}),
    Chunk(id="3", text="Third chunk of text", metadata={"source": "doc3"})
]

# Embed chunks (modifies chunks in-place)
embedded_chunks = embedder.embed(chunks)

# Check embeddings were added
for i, chunk in enumerate(embedded_chunks):
    print(f"Chunk {i+1}:")
    print(f"  Text: {chunk.text[:50]}...")
    print(f"  Embeddings: {len(chunk.embeddings)}")
    if chunk.embeddings:
        print(f"  Embedding name: {chunk.embeddings[0].name}")
        print(f"  Vector size: {len(chunk.embeddings[0].vector)}")

ChunkEmbedder

datapizza.embedders.ChunkEmbedder

__init__

a_embed async

embed

Usage

Features

Examples

Basic Chunk Embedding

init

a_embed `async`