feat: Vector Search and Similarity Indexing #2678

JKamsker · 2025-09-30T07:52:00Z

This pull request introduces a major new capability to LiteDB: Vector Search. This feature enables storing vector embeddings and performing efficient Approximate Nearest Neighbor (ANN) searches, making LiteDB suitable for a wide range of AI and machine learning applications, such as semantic search, recommendation engines, and Retrieval-Augmented Generation (RAG).

Key Features

Vector Storage & Indexing:
- A new BsonVector type (float[]) is introduced for native and efficient storage of vector embeddings.
- A powerful vector index based on the HNSW (Hierarchical Navigable Small World) algorithm provides fast and scalable similarity searches.
- Supports common distance metrics:
  - VectorDistanceMetric.Cosine (default)
  - VectorDistanceMetric.Euclidean
  - VectorDistanceMetric.DotProduct
New Fluent Query API:
- The query API has been extended with intuitive methods for vector operations, available via the LiteDB.Vector namespace:
  - .WhereNear(x => x.Embedding, targetVector, maxDistance): Filters documents where the vector is within a specified distance of the target.
  - .TopKNear(x => x.Embedding, targetVector, k): Efficiently retrieves the top K nearest neighbors to a target vector, using the index to avoid full scans and sorting.
SQL / BsonExpression Support:
- A new VECTOR_SIM(field, target_vector) function and infix operator have been added, allowing vector similarity calculations directly within expressions and SQL queries.
- Example: SELECT * FROM docs WHERE $.Embedding VECTOR_SIM @0 <= 0.25

End-to-End Demo Application

To showcase this new feature, a new demo project LiteDB.Demo.Tools.VectorSearch has been added. It's a command-line tool that provides a complete workflow for building a semantic search engine:

Ingest: Reads text documents (.txt, .md), splits them into chunks, and generates embeddings using the Google Gemini API.
Store: Persists the document metadata and vector embeddings in a LiteDB database, creating a vector index on the embeddings.
Search: Takes a user query, generates an embedding, and uses .TopKNear() to find and display the most semantically similar document chunks from the database.

Usage Example

Getting started with vector search is straightforward.

using LiteDB;
using LiteDB.Vector; // Import the new vector extensions

public class Document
{
    public int Id { get; set; }
    public string Text { get; set; }
    public float[] Embedding { get; set; }
}

var docs = db.GetCollection<Document>("docs");

// Create a vector index on the 'Embedding' property with 128 dimensions
// and using the Cosine distance metric.
docs.EnsureIndex(x => x.Embedding, new VectorIndexOptions(128));

// Insert a document with its vector embedding
docs.Insert(new Document 
{ 
    Id = 1, 
    Text = "This is a sample document.",
    Embedding = new float[] { 0.1f, 0.2f, ... } 
});

// Define a target vector for your search query
var queryVector = new float[] { 0.11f, 0.22f, ... };

// Find the 5 most similar documents
var results = docs.Query()
    .TopKNear(x => x.Embedding, queryVector, k: 5)
    .ToList();

Additional Changes

Engine Integration: The core database engine (Insert, Update, Delete, DropCollection, Rebuild) has been updated to fully manage the lifecycle of vector indexes and their associated data pages.
Testing & Benchmarks: Added comprehensive unit tests to validate the correctness of the index and distance calculations against reference implementations. Performance benchmarks have also been added to demonstrate the speed advantage of indexed queries over full scans.

Benchmarks

Credits

Thanks @hurley451 for the initial work

Copilot

Pull Request Overview

This pull request introduces comprehensive vector similarity search and benchmarking capabilities to LiteDB, implementing vector indexes, similarity queries, and CLI tooling for document ingestion and search using embeddings. The changes significantly expand LiteDB's functionality beyond traditional document storage to include modern vector database features.

Adds vector similarity search with support for multiple distance metrics (cosine, Euclidean, dot product)
Implements hierarchical navigable small world (HNSW) vector index structures for efficient similarity queries
Creates extensible CLI tools for document ingestion, chunking, embedding generation, and vector search operations

Reviewed Changes

Copilot reviewed 69 out of 70 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
LiteDB/Engine/Structures/VectorIndexNode.cs	Core vector index node implementation with multi-level neighbor management
LiteDB/Engine/Services/VectorIndexService.cs	Vector similarity search service with HNSW algorithm implementation
LiteDB/Document/BsonVector.cs	New BsonVector type for native vector data storage
LiteDB/Client/Vector/*	Client-side vector extensions for collections, queries, and repositories
LiteDB/Engine/Query/IndexQuery/VectorIndexQuery.cs	Query execution engine for vector similarity searches
tests.runsettings & tests.ci.runsettings	Test configuration files with timeout and CI optimization settings

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

LiteDB/Engine/Services/VectorIndexService.cs

Implement vectors

3284c15

Copilot AI review requested due to automatic review settings September 30, 2025 07:52

JKamsker changed the base branch from master to dev September 30, 2025 07:52

Copilot AI reviewed Sep 30, 2025

View reviewed changes

LiteDB/Engine/Services/VectorIndexService.cs Show resolved Hide resolved

LiteDB/Engine/Services/VectorIndexService.cs Show resolved Hide resolved

JKamsker changed the title ~~[Feature] Vector Search support~~ feat: Vector Search and Similarity Indexing Sep 30, 2025

JKamsker merged commit 66ec615 into dev Sep 30, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Vector Search and Similarity Indexing #2678

feat: Vector Search and Similarity Indexing #2678

Uh oh!

JKamsker commented Sep 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: Vector Search and Similarity Indexing #2678

feat: Vector Search and Similarity Indexing #2678

Uh oh!

Conversation

JKamsker commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Features

End-to-End Demo Application

Usage Example

Additional Changes

Benchmarks

Credits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JKamsker commented Sep 30, 2025 •

edited

Loading