Ingestion
datapizza.pipeline.pipeline.IngestionPipeline
A pipeline for ingesting data into a vector store.
__init__
Initialize the ingestion pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
modules
|
list[PipelineComponent]
|
List of pipeline components. Defaults to None. |
None
|
vector_store
|
Vectorstore
|
Vector store to store the ingested data. Defaults to None. |
None
|
collection_name
|
str
|
Name of the vector store collection to store the ingested data. Defaults to None. |
None
|
a_run
async
Run the ingestion pipeline asynchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str | list[str]
|
The file path or list of file paths to ingest. |
required |
metadata
|
dict
|
Metadata to add to the ingested chunks. Defaults to None. |
None
|
Returns:
| Type | Description |
|---|---|
list[Chunk] | None
|
list[Chunk] | None: If vector_store is not set, returns all accumulated chunks from all files. If vector_store is set, returns None after storing all chunks. |
from_yaml
Load the ingestion pipeline from a YAML configuration file.
The YAML configuration supports the following sections: - constants: Key-value pairs for string substitution using ${VAR_NAME} syntax - elements: Reusable component definitions that can be referenced in modules - ingestion_pipeline: The main pipeline configuration with clients, modules, vector_store, and collection_name
Example elements section
elements: my_embedder: type: GoogleEmbedder module: datapizza.embedders.google params: max_char: 2000
Elements can be referenced in module params using ${element_name} syntax: modules: - name: embedder type: ChunkEmbedder module: datapizza.embedders params: client: "${my_embedder}"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_path
|
str
|
Path to the YAML configuration file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
IngestionPipeline |
IngestionPipeline
|
The ingestion pipeline instance. |
run
Run the ingestion pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str | list[str]
|
The file path or list of file paths to ingest. |
required |
metadata
|
dict
|
Metadata to add to the ingested chunks. Defaults to None. |
None
|
Returns:
| Type | Description |
|---|---|
list[Chunk] | None
|
list[Chunk] | None: If vector_store is not set, returns all accumulated chunks from all files. If vector_store is set, returns None after storing all chunks. |