Skip to content

Conversation

JKamsker
Copy link
Collaborator

@JKamsker JKamsker commented Sep 30, 2025

Fixes: #2364 and #2580
Ref: #2646

This pull request introduces a major new capability to LiteDB: Vector Search. This feature enables storing vector embeddings and performing efficient Approximate Nearest Neighbor (ANN) searches, making LiteDB suitable for a wide range of AI and machine learning applications, such as semantic search, recommendation engines, and Retrieval-Augmented Generation (RAG).

Key Features

  • Vector Storage & Indexing:

    • A new BsonVector type (float[]) is introduced for native and efficient storage of vector embeddings.
    • A powerful vector index based on the HNSW (Hierarchical Navigable Small World) algorithm provides fast and scalable similarity searches.
    • Supports common distance metrics:
      • VectorDistanceMetric.Cosine (default)
      • VectorDistanceMetric.Euclidean
      • VectorDistanceMetric.DotProduct
  • New Fluent Query API:

    • The query API has been extended with intuitive methods for vector operations, available via the LiteDB.Vector namespace:
      • .WhereNear(x => x.Embedding, targetVector, maxDistance): Filters documents where the vector is within a specified distance of the target.
      • .TopKNear(x => x.Embedding, targetVector, k): Efficiently retrieves the top K nearest neighbors to a target vector, using the index to avoid full scans and sorting.
  • SQL / BsonExpression Support:

    • A new VECTOR_SIM(field, target_vector) function and infix operator have been added, allowing vector similarity calculations directly within expressions and SQL queries.
    • Example: SELECT * FROM docs WHERE $.Embedding VECTOR_SIM @0 <= 0.25

End-to-End Demo Application

To showcase this new feature, a new demo project LiteDB.Demo.Tools.VectorSearch has been added. It's a command-line tool that provides a complete workflow for building a semantic search engine:

  1. Ingest: Reads text documents (.txt, .md), splits them into chunks, and generates embeddings using the Google Gemini API.
  2. Store: Persists the document metadata and vector embeddings in a LiteDB database, creating a vector index on the embeddings.
  3. Search: Takes a user query, generates an embedding, and uses .TopKNear() to find and display the most semantically similar document chunks from the database.

Usage Example

Getting started with vector search is straightforward.

using LiteDB;
using LiteDB.Vector; // Import the new vector extensions

public class Document
{
    public int Id { get; set; }
    public string Text { get; set; }
    public float[] Embedding { get; set; }
}

var docs = db.GetCollection<Document>("docs");

// Create a vector index on the 'Embedding' property with 128 dimensions
// and using the Cosine distance metric.
docs.EnsureIndex(x => x.Embedding, new VectorIndexOptions(128));

// Insert a document with its vector embedding
docs.Insert(new Document 
{ 
    Id = 1, 
    Text = "This is a sample document.",
    Embedding = new float[] { 0.1f, 0.2f, ... } 
});

// Define a target vector for your search query
var queryVector = new float[] { 0.11f, 0.22f, ... };

// Find the 5 most similar documents
var results = docs.Query()
    .TopKNear(x => x.Embedding, queryVector, k: 5)
    .ToList();

Additional Changes

  • Engine Integration: The core database engine (Insert, Update, Delete, DropCollection, Rebuild) has been updated to fully manage the lifecycle of vector indexes and their associated data pages.
  • Testing & Benchmarks: Added comprehensive unit tests to validate the correctness of the index and distance calculations against reference implementations. Performance benchmarks have also been added to demonstrate the speed advantage of indexed queries over full scans.

Benchmarks

grafik

Credits

Thanks @hurley451 for the initial work

@Copilot Copilot AI review requested due to automatic review settings September 30, 2025 07:52
@JKamsker JKamsker changed the base branch from master to dev September 30, 2025 07:52
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces comprehensive vector similarity search and benchmarking capabilities to LiteDB, implementing vector indexes, similarity queries, and CLI tooling for document ingestion and search using embeddings. The changes significantly expand LiteDB's functionality beyond traditional document storage to include modern vector database features.

  • Adds vector similarity search with support for multiple distance metrics (cosine, Euclidean, dot product)
  • Implements hierarchical navigable small world (HNSW) vector index structures for efficient similarity queries
  • Creates extensible CLI tools for document ingestion, chunking, embedding generation, and vector search operations

Reviewed Changes

Copilot reviewed 69 out of 70 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
LiteDB/Engine/Structures/VectorIndexNode.cs Core vector index node implementation with multi-level neighbor management
LiteDB/Engine/Services/VectorIndexService.cs Vector similarity search service with HNSW algorithm implementation
LiteDB/Document/BsonVector.cs New BsonVector type for native vector data storage
LiteDB/Client/Vector/* Client-side vector extensions for collections, queries, and repositories
LiteDB/Engine/Query/IndexQuery/VectorIndexQuery.cs Query execution engine for vector similarity searches
tests.runsettings & tests.ci.runsettings Test configuration files with timeout and CI optimization settings

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@JKamsker JKamsker changed the title [Feature] Vector Search support feat: Vector Search and Similarity Indexing Sep 30, 2025
@JKamsker JKamsker merged commit 66ec615 into dev Sep 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Vector functionality for LLM embedding
1 participant