fabric_to_espanso / docs /qdrant_lessons_learned.md
Hopsakee's picture
Upload folder using huggingface_hub
5fe3652 verified

A newer version of the Gradio SDK is available: 5.29.0

Upgrade

Qdrant Integration: Lessons Learned

Introduction

This document summarizes our experience integrating Qdrant vector database with FastEmbed for embedding generation. We encountered several challenges related to vector naming conventions, search query formats, and other aspects of working with Qdrant. This document outlines the issues we faced and the solutions we implemented to create a robust vector search system.

Problem Statement

We were experiencing issues with vector name mismatches in our Qdrant integration. Specifically:

  1. Points were being skipped during processing with the error message "Skipping point as it has no valid vector"
  2. The vector names we specified in our configuration did not match the actual vector names used in the Qdrant collection
  3. We had implemented unnecessary sanitization of model names

Understanding Vector Names in Qdrant

How Qdrant Handles Vector Names

According to the Qdrant documentation, when creating a collection with vectors, you specify vector names and their configurations. These names are used as keys when inserting and querying vectors.

However, when using FastEmbed with Qdrant, we discovered that the model names specified in the configuration are transformed before being used as vector names in the collection:

  • Original model name: "intfloat/multilingual-e5-large"
  • Actual vector name in Qdrant: "fast-multilingual-e5-large"

Similarly for sparse vectors:

  • Original model name: "prithivida/Splade_PP_en_v1"
  • Actual vector name in Qdrant: "fast-sparse-splade_pp_en_v1"

Initial Approach (Problematic)

Our initial approach was to manually transform the model names using a format_vector_name function:

def format_vector_name(name: str) -> str:
    """Format a model name into a valid vector name for Qdrant."""
    return name.replace('/', '_')

This led to inconsistencies because:

  1. We were using one transformation in our code (replace('/', '_'))
  2. FastEmbed was using a different transformation (prefixing with "fast-" and removing slashes)

Solution: Dynamic Vector Name Discovery

Instead of trying to predict how FastEmbed transforms model names, we implemented a solution that dynamically discovers the actual vector names from the Qdrant collection configuration.

Helper Functions

We added two helper functions to retrieve the actual vector names:

def get_dense_vector_name(client: QdrantClient, collection_name: str) -> str:
    """
    Get the name of the dense vector from the collection configuration.
    
    Args:
        client: Initialized Qdrant client
        collection_name: Name of the collection
        
    Returns:
        Name of the dense vector as used in the collection
    """
    try:
        return list(client.get_collection(collection_name).config.params.vectors.keys())[0]
    except (IndexError, AttributeError) as e:
        logger.warning(f"Could not get dense vector name: {e}")
        # Fallback to a default name
        return "fast-multilingual-e5-large"

def get_sparse_vector_name(client: QdrantClient, collection_name: str) -> str:
    """
    Get the name of the sparse vector from the collection configuration.
    
    Args:
        client: Initialized Qdrant client
        collection_name: Name of the collection
        
    Returns:
        Name of the sparse vector as used in the collection
    """
    try:
        return list(client.get_collection(collection_name).config.params.sparse_vectors.keys())[0]
    except (IndexError, AttributeError) as e:
        logger.warning(f"Could not get sparse vector name: {e}")
        # Fallback to a default name
        return "fast-sparse-splade_pp_en_v1"

Implementation in Vector Creation

When creating new points or updating existing ones, we now use these helper functions to get the correct vector names:

# Get vector names from the collection configuration
dense_vector_name = get_dense_vector_name(client, collection_name)
sparse_vector_name = get_sparse_vector_name(client, collection_name)

# Create point with the correct vector names
point = PointStruct(
    id=str(uuid.uuid4()),
    vector={
        dense_vector_name: get_embedding(payload_new['purpose'])[0],
        sparse_vector_name: get_embedding(payload_new['purpose'])[1]
    },
    payload={
        # payload fields...
    }
)

Implementation in Vector Querying

Similarly, when querying vectors, we use the same helper functions:

# Get the actual vector names from the collection configuration
dense_vector_name = get_dense_vector_name(client, collection_name)

# Skip points without vector or without the required vector type
if not point.vector or dense_vector_name not in point.vector:
    logger.debug(f"Skipping point {point_id} as it has no valid vector")
    continue
    
# Find semantically similar points using Qdrant's search
similar_points = client.search(
    collection_name=collection_name,
    query_vector={
        dense_vector_name: point.vector.get(dense_vector_name)
    },
    limit=100,
    score_threshold=SIMILARITY_THRESHOLD
)

Key Insights

  1. Model Names vs. Vector Names: There's a distinction between the model names you specify in your configuration and the actual vector names used in the Qdrant collection. FastEmbed transforms these names.

  2. Dynamic Discovery: Instead of hardcoding vector names or trying to predict the transformation, it's better to dynamically discover the actual vector names from the collection configuration.

  3. Fallback Mechanism: Always include fallback mechanisms in case the collection information can't be retrieved, making your code more robust.

  4. Consistency: Use the same vector names throughout your system to ensure consistency between vector creation, storage, and retrieval.

  5. Correct Search Query Format: When using named vectors in Qdrant search queries, you must use the correct format. Instead of passing a dictionary with vector names as keys, use the query_vector parameter for the actual vector and the using parameter to specify which named vector to use.

Accessing Collection Configuration

The key to our solution was discovering how to access the collection configuration to get the actual vector names:

# Get dense vector name
dense_vector_name = list(client.get_collection(collection_name).config.params.vectors.keys())[0]

# Get sparse vector name
sparse_vector_name = list(client.get_collection(collection_name).config.params.sparse_vectors.keys())[0]

This approach allows our code to adapt to however FastEmbed decides to name the vectors in the collection, rather than assuming a specific naming convention.

Correct Search Query Format for Named Vectors

When using named vectors in Qdrant, it's important to use the correct format for search queries. The format depends on the version of the Qdrant client you're using:

Incorrect Format (Causes Validation Error)

# This format causes a validation error
similar_points = client.search(
    collection_name=collection_name,
    query_vector={
        dense_vector_name: point.vector.get(dense_vector_name)
    },
    limit=100
)

Correct Format for Qdrant Client Version 1.12.2

# This is the correct format for Qdrant client version 1.12.2
similar_points = client.search(
    collection_name=collection_name,
    query_vector=(dense_vector_name, point.vector.get(dense_vector_name)),  # Tuple of (vector_name, vector_values)
    limit=100,
    score_threshold=0.8  # Optional similarity threshold
)

In Qdrant client version 1.12.2, the correct way to specify which named vector to use is by providing a tuple to the query_vector parameter. The tuple should contain the vector name as the first element and the actual vector values as the second element.

Using the incorrect format will result in a Pydantic validation error with messages like:

validation errors for SearchRequest
vector.list[float]
  Input should be a valid list [type=list_type, input_value={'fast-multilingual-e5-la...}, input_type=dict]
vector.NamedVector.name
  Field required [type=missing, input_value={'fast-multilingual-e5-la...}, input_type=dict]

Optimizing Search Parameters for Deduplication

When using Qdrant for deduplication of similar content, the search parameters play a crucial role in determining the effectiveness of the process. We've found the following parameters to be particularly important:

Similarity Threshold

The score_threshold parameter determines the minimum similarity score required for points to be considered similar:

similar_points = client.search(
    collection_name=collection_name,
    query_vector=(dense_vector_name, point.vector.get(dense_vector_name)),
    limit=100,
    score_threshold=0.9  # Only consider points with similarity > 90%
)

For deduplication purposes, we found that a higher threshold (0.9) works better than a lower one (0.7) to avoid false positives. This means that only very similar items will be considered duplicates.

Text Difference Threshold

In addition to vector similarity, we also check the actual text difference between potential duplicates:

# Constants for duplicate detection
SIMILARITY_THRESHOLD = 0.9  # Minimum semantic similarity to consider as potential duplicate
DIFFERENCE_THRESHOLD = 0.05  # Maximum text difference (5%) to consider as duplicate

The DIFFERENCE_THRESHOLD of 0.05 means that texts with less than 5% difference will be considered duplicates. This two-step verification (vector similarity + text difference) helps to ensure that only true duplicates are removed.

Logging Considerations

When working with Qdrant, especially during development and debugging, it's helpful to adjust the logging level:

# Set log level and prevent propagation
logger.setLevel(logging.DEBUG)  # For development/debugging
logger.setLevel(logging.INFO)   # For production

Using DEBUG level during development provides detailed information about vector operations, including:

  • Which points are being processed
  • Why points are being skipped (e.g., missing vectors)
  • Similarity scores between points
  • Deduplication decisions

However, in production, it's better to use INFO level to reduce log volume, especially when processing large collections.

Performance Considerations

Batch Operations

When working with large numbers of points, it's more efficient to use batch operations:

# Batch upsert example
client.upsert(
    collection_name=collection_name,
    points=batch_of_points  # List of PointStruct objects
)

This reduces network overhead compared to upserting points individually.

Search Limit

The limit parameter in search operations should be set carefully:

similar_points = client.search(
    collection_name=collection_name,
    query_vector=(dense_vector_name, point.vector.get(dense_vector_name)),
    limit=100,  # Maximum number of similar points to return
    score_threshold=0.9
)

A higher limit increases the chance of finding all duplicates but also increases search time. For deduplication purposes, we found that a limit of 100 provides a good balance between thoroughness and performance.

Conclusion

Our experience with Qdrant has taught us several important lessons:

  1. Dynamic Vector Name Discovery: By retrieving the actual vector names from the Qdrant collection configuration, we've created a robust solution that adapts to the naming conventions used by FastEmbed and Qdrant.

  2. Correct Query Format: Using the proper format for search queries with named vectors is essential - specifically using a tuple of (vector_name, vector_values) for the query_vector parameter.

  3. Optimized Search Parameters: Fine-tuning similarity thresholds and text difference thresholds is crucial for effective deduplication, with higher thresholds (0.9 for similarity, 0.05 for text difference) providing better results.

  4. Appropriate Logging Levels: Using DEBUG level during development and INFO in production helps balance between having enough information for troubleshooting and maintaining performance.

  5. Batch Operations: Using batch operations for inserting and updating points significantly improves performance when working with large collections.

By implementing these lessons, we've created a more efficient and reliable vector search system that properly handles named vectors, effectively identifies duplicates, and maintains good performance even with large collections.

This solution should work regardless of changes to the naming conventions in future versions of Qdrant or FastEmbed, as it reads the actual names directly from the collection configuration.