ChunkEmbedder
datapizza.embedders.ChunkEmbedder
Bases: PipelineComponent
ChunkEmbedder is a module that given a list of chunks, it put a list of embeddings in each chunk.
__init__
Initialize the ChunkEmbedder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
BaseEmbedder
|
The client to use for embedding. |
required |
model_name
|
str
|
The model name to use for embedding. Defaults to None. |
None
|
embedding_name
|
str
|
The name of the embedding to use. Defaults to None. |
None
|
batch_size
|
int
|
The batch size to use for embedding. Defaults to 2047. |
2047
|
a_embed
async
Usage
from datapizza.embedders import ChunkEmbedder
from datapizza.core.clients import Client
# Initialize with any compatible client
client = Client(...) # Your client instance
embedder = ChunkEmbedder(
client=client,
model_name="text-embedding-ada-002", # Optional model override
embedding_name="my_embeddings", # Optional custom embedding name
batch_size=100 # Optional batch size for processing
)
# Embed chunks - adds embeddings to chunk objects
embedded_chunks = embedder.embed(chunks)
Features
- Specialized for embedding lists of Chunk objects
- Batch processing with configurable batch size
- Adds embeddings directly to Chunk objects
- Preserves original chunk structure and metadata
- Async embedding support with
a_embed()
- Memory efficient batch processing
- Works with any compatible LLM client
Examples
Basic Chunk Embedding
import os
from datapizza.embedders import ChunkEmbedder
from datapizza.embedders.openai import OpenAIEmbedder
from datapizza.type import Chunk
from dotenv import load_dotenv
load_dotenv()
# Create client and embedder
client = OpenAIEmbedder(api_key=os.getenv("OPENAI_API_KEY"))
embedder = ChunkEmbedder(
client=client,
model_name="text-embedding-ada-002",
batch_size=50
)
# Create sample chunks
chunks = [
Chunk(id="1", text="First chunk of text", metadata={"source": "doc1"}),
Chunk(id="2", text="Second chunk of text", metadata={"source": "doc2"}),
Chunk(id="3", text="Third chunk of text", metadata={"source": "doc3"})
]
# Embed chunks (modifies chunks in-place)
embedded_chunks = embedder.embed(chunks)
# Check embeddings were added
for i, chunk in enumerate(embedded_chunks):
print(f"Chunk {i+1}:")
print(f" Text: {chunk.text[:50]}...")
print(f" Embeddings: {len(chunk.embeddings)}")
if chunk.embeddings:
print(f" Embedding name: {chunk.embeddings[0].name}")
print(f" Vector size: {len(chunk.embeddings[0].vector)}")