Create or Update Collection
The Create or Update Collection component is a powerful tool in the Retrieval Augmented Generation(RAG) category. It manages a Vector Database collection by allowing you to create a new collection or update an existing one. Specify a database provider, collection name, and partition details while configuring embeddings and document structures. The component supports adding, updating, or deleting documents based on defined conditions, ensuring efficient storage and retrieval of vector-based data.
Overview
The Create or Update Collection component creates or updates Vector Database collections. The component manages the storage infrastructure required for semantic search and vector-based retrieval in GenAI applications.
This component serves as the final step in RAG pipeline preparation, taking chunked text data and converting it into a searchable vector database collection with embeddings.
How to use:
Important
You must use a mandatory custom function which transforms chunk arrays into collection data format. Place the Call Custom Function component immediately after the Chunk Text and before Create or Update Collection.
The custom function should implement following logic:
Access session data using session = message.request.session and pipeline_id = message.get_arg('pipeline_id').
For multiple documents (loop scenario): Get current document URL using loop_item = session.get_session_data(pipeline_id, 'loop_item') and extract document name using document_name = os.path.basename(loop_item).
For single document: Get document name from input parameter using request_inputs = message.request.httpRequest.get("inputs", {}) and document_name = request_inputs.get("input_var_name", "default_value").
Retrieve processed chunks using chunks = session.get_session_data(pipeline_id, 'doc_chunks').
Error handling: Raise ValueError if chunks is None or empty. Initialize empty results list.
Process each chunk in iteration: Generate unique UUID using random_id = str(uuid.uuid4()), create structured entry as chunk_entry = {"id": random_id, "document_name": document_name, "text": chunk}, append to results list.
Return complete results array. This transformation ensures each chunk becomes a properly structured database record with unique identification, source document tracking, and embeddable text content.
COLLECTION DATA SOURCE: The collection_data input must always use session source pointing to the output variable from the prerequisite custom function (typically mapped as session path like "transformed_chunks" or "collection_ready_data").
Key Terms
Term |
Definition |
|---|---|
Vector Database |
A specialized database is designed to store and search vector embeddings, which are numerical representations of text, images, or other data. |
Collection |
A container in a vector database that holds a set of related documents and their vector embeddings. |
Embeddings |
Numerical vector representations of data (like text) that capture semantic meaning, allowing for similarity-based searches. |
Schema |
The structure defines how data is organized in the collection, including field names, data types, and properties. |
When to Use
You can use the Create or Update Collection component when:
Setting up a new RAG system that requires vector storage.
You need to store and retrieve data using semantic similarity rather than exact matching.
To update existing vector collections with new documents or schema changes.
Building GenAI applications that need to reference external knowledge.
Implementing semantic search functionality in your Pipeline Builder flow.
Component Configuration
Required Inputs
Input |
Description |
Data Type |
Example |
|---|---|---|---|
VectorDB Provider |
The vector database provider stores your collection. Currently supports Milvus as the default option. |
VectorDBProvider |
|
Collection ID |
A unique identifier for your collection. This is used to reference the collection in other components. |
String |
|
ID Column Name |
The name of the column that is serve as the primary identifier for documents in the collection. |
String |
|
Embed Column Name |
The name of the column containing text that is converted to vector embeddings for semantic search. |
String |
|
Document Column Names |
An array of column names is included as part of each document in the collection. |
Array |
|
Document Schema |
JSON describes the schema of the document, including field names, data types, and properties. |
JSON |
|
Optional Inputs
Input |
Description |
Data Type |
Example |
|---|---|---|---|
Embeddings Provider |
The service that generates vector embeddings from your text. If not provided, a default provider can be used. |
EmbeddingsProvider |
|
Partition ID |
An optional identifier for a partition within the collection. Partitions are logical divisions of your data that allow for more efficient querying and management. They enable you to organize related documents together and can significantly improve performance when working with large datasets. When querying, you can target specific partitions rather than searching the entire collection. |
String |
|
Filter By Column Names |
An array of column names that can be used for filtering when querying the collection. |
Array |
|
Collection Data |
An array of documents to be added to the collection. Each document should conform to the defined schema. |
Array |
|
Documents to Update |
List of document IDs that should be updated. If not provided, all documents in the collection data get updated. |
Array |
|
Documents to Delete |
List of document IDs that should be deleted from the collection. |
Array |
|
Documents to Delete Where |
A condition based on which documents should be deleted, specified as a JSON object. |
JSON |
|
How It Works
The component first checks if the specified collection exists in the vector database.
If the collection does not exist, it creates a new collection based on the provided schema.
If the collection already exists, it validates that the incoming data matches the existing schema.
For document updates, the component processes the documents:
The specified embed column content is sent to the embedding provider to generate vector embeddings.
These embeddings, along with the original document data, are stored in the vector database.
If documents are specified for deletion, they are removed from the collection.
The component returns a response indicating the success or failure of the operation, including details about any documents that were added, updated, or deleted.
Typical RAG Pipeline Patterns
Single Document Flow: Download From S3 → Extract Text → Chunk Text → Call Custom Function (transformation) → Create or Update Collection.
Multiple Documents Flow: Loop → Download From S3 → Extract Text → Chunk Text → Call Custom Function (transformation) → Create or Update Collection(of all chunks of that doc). The component automatically handles collection creation for new collections or updates existing collections with additional data.
Custom Schema Adaptation: When users specify custom field requirements, modify the document_schema, document_column_names, id_column_name, and embed_column_name inputs accordingly, but maintain the same transformation logic pattern in the prerequisite custom function. The component integrates with Milvus vector database for persistent storage and SentenceTransformer for embedding generation, creating a searchable knowledge base infrastructure ready for RAG query operations.
Example Use Case: Multilingual Knowledge Base
Let's say you're building a multilingual knowledge base for your application documentation:
Scenario
You need to create a vector database collection that stores documentation in multiple languages and allows users to search for relevant information regardless of language.
Configuration
VectorDB Provider: Milvus - default
Collection ID: multilingual_docs
Embeddings Provider: SentenceTransformer_all-MiniLM-L6-v2_384
ID Column Name: ID
Embed Column Name: Description
Document Column Names: ["id", "name", "description", "definition", "language"]
Document Schema: A JSON object defining the schema with fields for id, name, description, definition, and language
Filter By Column Names: ["language", "name"]
Collection Data: An array of documentation entries in various languages
Process
The component creates a new vector collection called "multilingual_docs" in Milvus.
Each document's description field is converted to vector embeddings using the specified embeddings provider.
The documents, along with their embeddings, are stored in the collection.
The "language" and "name" fields are set up as filterable fields, allowing users to filter search results by language or document name.
Result
A vector database collection is created that enables semantic search across multilingual documentation. Users can now search for concepts and receive relevant results regardless of the language, and filter results by specific languages if needed.
Best Practices
Choose the right embed column: Select a column that contains rich, descriptive text that captures the meaning of each document for optimal semantic search performance.
Design your schema carefully: Plan your document structure with consideration for how you query and filter the data later.
Use meaningful document IDs: Choose IDs that have semantic meaning when possible to make debugging and data management easier.
Consider partitioning for large collections: Use partition IDs to organize large datasets, which can improve query performance.
Include relevant filter columns: Define columns that is commonly used for filtering to optimize search efficiency.
Batch updates for efficiency: When updating multiple documents, do so in batches rather than individually for better performance.
Troubleshooting
Issue |
Possible Cause |
Solution |
|---|---|---|
Collection creation fails |
Invalid schema definition or connection issues with the vector database |
Check the schema JSON for syntax errors and verify that the vector database provider is properly configured and accessible. |
Documents not being updated |
Document IDs do not match existing records or the "Documents to Update" list is incomplete |
Verify that the document IDs match exactly with existing records in the collection and ensure the update list includes all relevant documents. |
Embedding generation fails |
Issues with the embedding provider or invalid text in the embed column |
Check that the embedding provider is properly configured and that the text in the embed column is valid and within any length limitations. |
Schema mismatch errors |
Attempting to update a collection with data that does not match the existing schema |
Ensure that new or updated documents conform to the existing collection schema, especially regarding data types and required fields. |
Document Schema Reference
The Document Schema is a critical part of configuring your vector collection. It defines the structure, data types, and properties of the fields in your collection. Here is a detailed explanation of the schema elements:
Schema Structure
Property |
Description |
Example |
|---|---|---|
|
When set to |
|
|
When set to |
|
|
An array of field definitions that make up your document structure. |
See field definitions below |
Field Definitions
Each field in the schema has the following properties:
Property |
Description |
Example |
|---|---|---|
|
The name of the field as it appears in your documents. |
|
|
The data type of the field. See the Data Types table below for available options. |
|
|
For VARCHAR types, specifies the maximum length of text that can be stored in this field. |
|
|
When set to |
|
|
For primary key fields, when set to |
|
Data Types
The vector database supports various data types for fields:
Data Type |
Description |
Usage |
|---|---|---|
|
Variable-length character string. Requires a |
Text fields like names, descriptions, IDs, etc. |
|
64-bit integer values. |
Numeric IDs, counts, or other integer values. |
|
Single-precision floating point numbers. |
Decimal values, scores, or metrics. |
|
Double-precision floating point numbers. |
High-precision decimal values or scientific calculations. |
|
Boolean values (true/false). |
Flags, toggles, or binary states. |
|
JSON-formatted data that can contain complex nested structures. |
Storing structured data, configurations, or hierarchical information. |
|
An array of values of a specific type. |
Lists of tags, categories, or related items. |
Example Schema
{"auto_id":false,"enable_dynamic_field":true,"fields":[{"field_name":"id","datatype":"DataType.VARCHAR","max_length":100,"is_primary":true,"auto_id":false},{"field_name":"name","datatype":"DataType.VARCHAR","max_length":100},{"field_name":"description","datatype":"DataType.VARCHAR","max_length":2000},{"field_name":"category","datatype":"DataType.VARCHAR","max_length":50},{"field_name":"views","datatype":"DataType.INT64"},{"field_name":"active","datatype":"DataType.BOOL"},{"field_name":"metadata","datatype":"DataType.JSON"}]}
Limitations and Considerations
Performance with large collections: As your collection grows, search and update operations may become slower. Consider using partitions for better management of large datasets.
Text length limitations: Be aware of any maximum length constraints for the fields in your schema, especially for text that is embedded.
Embedding model selection: The choice of embedding model can significantly impact the quality of semantic search results. Consider domain-specific models for specialized content.
