Create or Update Collection

The Create or Update Collection component is a powerful tool in the Retrieval Augmented Generation(RAG) category. It manages a Vector Database collection by allowing you to create a new collection or update an existing one. Specify a database provider, collection name, and partition details while configuring embeddings and document structures. The component supports adding, updating, or deleting documents based on defined conditions, ensuring efficient storage and retrieval of vector-based data.

Overview

The Create or Update Collection component creates or updates Vector Database collections. The component manages the storage infrastructure required for semantic search and vector-based retrieval in GenAI applications.

This component serves as the final step in RAG pipeline preparation, taking chunked text data and converting it into a searchable vector database collection with embeddings.

How to use:

Important

You must use a mandatory custom function which transforms chunk arrays into collection data format. Place the Call Custom Function component immediately after the Chunk Text and before Create or Update Collection.

The custom function should implement following logic:

  • Access session data using session = message.request.session and pipeline_id = message.get_arg('pipeline_id').

  • For multiple documents (loop scenario): Get current document URL using loop_item = session.get_session_data(pipeline_id, 'loop_item') and extract document name using document_name = os.path.basename(loop_item).

  • For single document: Get document name from input parameter using request_inputs = message.request.httpRequest.get("inputs", {}) and document_name = request_inputs.get("input_var_name", "default_value").

  • Retrieve processed chunks using chunks = session.get_session_data(pipeline_id, 'doc_chunks').

  • Error handling: Raise ValueError if chunks is None or empty. Initialize empty results list.

  • Process each chunk in iteration: Generate unique UUID using random_id = str(uuid.uuid4()), create structured entry as chunk_entry = {"id": random_id, "document_name": document_name, "text": chunk}, append to results list.

  • Return complete results array. This transformation ensures each chunk becomes a properly structured database record with unique identification, source document tracking, and embeddable text content.

COLLECTION DATA SOURCE: The collection_data input must always use session source pointing to the output variable from the prerequisite custom function (typically mapped as session path like "transformed_chunks" or "collection_ready_data").

Key Terms

Term

Definition

Vector Database

A specialized database is designed to store and search vector embeddings, which are numerical representations of text, images, or other data.

Collection

A container in a vector database that holds a set of related documents and their vector embeddings.

Embeddings

Numerical vector representations of data (like text) that capture semantic meaning, allowing for similarity-based searches.

Schema

The structure defines how data is organized in the collection, including field names, data types, and properties.

When to Use

You can use the Create or Update Collection component when:

  • Setting up a new RAG system that requires vector storage.

  • You need to store and retrieve data using semantic similarity rather than exact matching.

  • To update existing vector collections with new documents or schema changes.

  • Building GenAI applications that need to reference external knowledge.

  • Implementing semantic search functionality in your Pipeline Builder flow.

Component Configuration

Required Inputs

Input

Description

Data Type

Example

VectorDB Provider

The vector database provider stores your collection. Currently supports Milvus as the default option.

VectorDBProvider

Milvus - default

Collection ID

A unique identifier for your collection. This is used to reference the collection in other components.

String

sample_collection

ID Column Name

The name of the column that is serve as the primary identifier for documents in the collection.

String

id

Embed Column Name

The name of the column containing text that is converted to vector embeddings for semantic search.

String

description

Document Column Names

An array of column names is included as part of each document in the collection.

Array

["id", "name", "description", "definition"]

Document Schema

JSON describes the schema of the document, including field names, data types, and properties.

JSON

json<br>{<br> "auto_id": false,<br> "enable_dynamic_field": true,<br> "fields": [<br> {<br> "field_name": "id",<br> "datatype": "DataType.VARCHAR",<br> "max_length": 100,<br> "is_primary": true<br> },<br> {<br> "field_name": "description",<br> "datatype": "DataType.VARCHAR",<br> "max_length": 2000<br> }<br> ]<br>}

Optional Inputs

Input

Description

Data Type

Example

Embeddings Provider

The service that generates vector embeddings from your text. If not provided, a default provider can be used.

EmbeddingsProvider

SentenceTransformer_all-MiniLM-L6-v2_384

Partition ID

An optional identifier for a partition within the collection. Partitions are logical divisions of your data that allow for more efficient querying and management. They enable you to organize related documents together and can significantly improve performance when working with large datasets. When querying, you can target specific partitions rather than searching the entire collection.

String

english_docs or technical_manuals

Filter By Column Names

An array of column names that can be used for filtering when querying the collection.

Array

["id", "name", "description"]

Collection Data

An array of documents to be added to the collection. Each document should conform to the defined schema.

Array

json<br>[<br> {<br> "id": "japanese_eng_text",<br> "name": "japanese english text",<br> "description": "The \"Create App Database\" Wizard...",<br> "definition": {<br> "id": "B9JYV1ysEJ12",<br> "name": "User"<br> }<br> }<br>]

Documents to Update

List of document IDs that should be updated. If not provided, all documents in the collection data get updated.

Array

["doc1", "doc2"]

Documents to Delete

List of document IDs that should be deleted from the collection.

Array

["old_doc1", "old_doc2"]

Documents to Delete Where

A condition based on which documents should be deleted, specified as a JSON object.

JSON

{"Library Id": "abc"}

How It Works

  1. The component first checks if the specified collection exists in the vector database.

  2. If the collection does not exist, it creates a new collection based on the provided schema.

  3. If the collection already exists, it validates that the incoming data matches the existing schema.

  4. For document updates, the component processes the documents:

    • The specified embed column content is sent to the embedding provider to generate vector embeddings.

    • These embeddings, along with the original document data, are stored in the vector database.

  5. If documents are specified for deletion, they are removed from the collection.

  6. The component returns a response indicating the success or failure of the operation, including details about any documents that were added, updated, or deleted.

Typical RAG Pipeline Patterns

  • Single Document Flow: Download From S3 → Extract Text → Chunk Text → Call Custom Function (transformation) → Create or Update Collection.

  • Multiple Documents Flow: Loop → Download From S3 → Extract Text → Chunk Text → Call Custom Function (transformation) → Create or Update Collection(of all chunks of that doc). The component automatically handles collection creation for new collections or updates existing collections with additional data.

  • Custom Schema Adaptation: When users specify custom field requirements, modify the document_schema, document_column_names, id_column_name, and embed_column_name inputs accordingly, but maintain the same transformation logic pattern in the prerequisite custom function. The component integrates with Milvus vector database for persistent storage and SentenceTransformer for embedding generation, creating a searchable knowledge base infrastructure ready for RAG query operations.

Example Use Case: Multilingual Knowledge Base

Let's say you're building a multilingual knowledge base for your application documentation:

Scenario

You need to create a vector database collection that stores documentation in multiple languages and allows users to search for relevant information regardless of language.

Configuration
  • VectorDB Provider: Milvus - default

  • Collection ID: multilingual_docs

  • Embeddings Provider: SentenceTransformer_all-MiniLM-L6-v2_384

  • ID Column Name: ID

  • Embed Column Name: Description

  • Document Column Names: ["id", "name", "description", "definition", "language"]

  • Document Schema: A JSON object defining the schema with fields for id, name, description, definition, and language

  • Filter By Column Names: ["language", "name"]

  • Collection Data: An array of documentation entries in various languages

Process
  1. The component creates a new vector collection called "multilingual_docs" in Milvus.

  2. Each document's description field is converted to vector embeddings using the specified embeddings provider.

  3. The documents, along with their embeddings, are stored in the collection.

  4. The "language" and "name" fields are set up as filterable fields, allowing users to filter search results by language or document name.

Result

A vector database collection is created that enables semantic search across multilingual documentation. Users can now search for concepts and receive relevant results regardless of the language, and filter results by specific languages if needed.

Best Practices

  • Choose the right embed column: Select a column that contains rich, descriptive text that captures the meaning of each document for optimal semantic search performance.

  • Design your schema carefully: Plan your document structure with consideration for how you query and filter the data later.

  • Use meaningful document IDs: Choose IDs that have semantic meaning when possible to make debugging and data management easier.

  • Consider partitioning for large collections: Use partition IDs to organize large datasets, which can improve query performance.

  • Include relevant filter columns: Define columns that is commonly used for filtering to optimize search efficiency.

  • Batch updates for efficiency: When updating multiple documents, do so in batches rather than individually for better performance.

Troubleshooting

Issue

Possible Cause

Solution

Collection creation fails

Invalid schema definition or connection issues with the vector database

Check the schema JSON for syntax errors and verify that the vector database provider is properly configured and accessible.

Documents not being updated

Document IDs do not match existing records or the "Documents to Update" list is incomplete

Verify that the document IDs match exactly with existing records in the collection and ensure the update list includes all relevant documents.

Embedding generation fails

Issues with the embedding provider or invalid text in the embed column

Check that the embedding provider is properly configured and that the text in the embed column is valid and within any length limitations.

Schema mismatch errors

Attempting to update a collection with data that does not match the existing schema

Ensure that new or updated documents conform to the existing collection schema, especially regarding data types and required fields.

Document Schema Reference

The Document Schema is a critical part of configuring your vector collection. It defines the structure, data types, and properties of the fields in your collection. Here is a detailed explanation of the schema elements:

Schema Structure

Property

Description

Example

auto_id

When set to true, the system automatically generates unique IDs for each document. When false, you must provide your IDs for each document.

false

enable_dynamic_field

When set to true, allows documents to contain fields not explicitly defined in the schema. This provides flexibility but may impact performance.

true

fields

An array of field definitions that make up your document structure.

See field definitions below

Field Definitions

Each field in the schema has the following properties:

Property

Description

Example

field_name

The name of the field as it appears in your documents.

"id"

datatype

The data type of the field. See the Data Types table below for available options.

"DataType.VARCHAR"

max_length

For VARCHAR types, specifies the maximum length of text that can be stored in this field.

100

is_primary

When set to true, designates this field as the primary key for the collection. Only one field can be primary.

true

auto_id

For primary key fields, when set to true, the system auto-generates values for this field.

false

Data Types

The vector database supports various data types for fields:

Data Type

Description

Usage

DataType.VARCHAR

Variable-length character string. Requires a max_length specification.

Text fields like names, descriptions, IDs, etc.

DataType.INT64

64-bit integer values.

Numeric IDs, counts, or other integer values.

DataType.FLOAT

Single-precision floating point numbers.

Decimal values, scores, or metrics.

DataType.DOUBLE

Double-precision floating point numbers.

High-precision decimal values or scientific calculations.

DataType.BOOL

Boolean values (true/false).

Flags, toggles, or binary states.

DataType.JSON

JSON-formatted data that can contain complex nested structures.

Storing structured data, configurations, or hierarchical information.

DataType.ARRAY

An array of values of a specific type.

Lists of tags, categories, or related items.

Example Schema

{"auto_id":false,"enable_dynamic_field":true,"fields":[{"field_name":"id","datatype":"DataType.VARCHAR","max_length":100,"is_primary":true,"auto_id":false},{"field_name":"name","datatype":"DataType.VARCHAR","max_length":100},{"field_name":"description","datatype":"DataType.VARCHAR","max_length":2000},{"field_name":"category","datatype":"DataType.VARCHAR","max_length":50},{"field_name":"views","datatype":"DataType.INT64"},{"field_name":"active","datatype":"DataType.BOOL"},{"field_name":"metadata","datatype":"DataType.JSON"}]}

Limitations and Considerations

  • Performance with large collections: As your collection grows, search and update operations may become slower. Consider using partitions for better management of large datasets.

  • Text length limitations: Be aware of any maximum length constraints for the fields in your schema, especially for text that is embedded.

  • Embedding model selection: The choice of embedding model can significantly impact the quality of semantic search results. Consider domain-specific models for specialized content.