Skip to content

Conversation

JKamsker
Copy link
Collaborator

@JKamsker JKamsker commented Sep 24, 2025

Fixes: #2364 and #2580

This pull request introduces vector similarity search and benchmarking capabilities to the LiteDB project, along with new CLI tooling for document ingestion and search using embeddings. The main changes include adding vector fields to benchmark models, implementing benchmarks for vector similarity queries, and creating extensible CLI commands for ingesting and searching documents with vector embeddings. The changes are grouped below by theme.

Vector Similarity Benchmarking:

  • Added a new benchmark class QueryWithVectorSimilarity to LiteDB.Benchmarks/Benchmarks/Queries/QueryWithVectorSimilarity.cs, enabling performance testing of vector similarity queries (both indexed and unindexed) using the new Vectors field.
  • Updated the FileMetaBase model to include a float[] Vectors property, supporting storage of vector embeddings for similarity search.
  • Modified the FileMetaGenerator to generate random 128-dimensional vectors for each document, ensuring benchmark data includes vector embeddings. [1] [2]

Vector Search CLI Tooling:

  • Implemented the IngestCommand in LiteDB.Demo.Tools.VectorSearch/Commands/IngestCommand.cs, which ingests documents from a directory, splits them into chunks, generates embeddings, and stores them in a database with chunk-level vector indexing. Includes options for skipping unchanged files, pruning missing files, and progress reporting.
  • Added the SearchCommand in LiteDB.Demo.Tools.VectorSearch/Commands/SearchCommand.cs, allowing users to search for documents via semantic similarity using embeddings, with support for top-K results, max distance filtering, and customizable output.
  • Created a shared base class VectorSearchCommandSettings to handle configuration, authentication, and validation for embedding services and database paths in CLI commands.

Benchmarks

grafik

Credits

Thanks @hurley451 for the initial work

…im-function

Support vector similarity operator in predicates
Fix vector serialization and comparison
Fix vector serialization and comparison
…t-in-jsonwriter

Handle BsonVector in JSON writer
…ngestcommand

Fix vector index mapping for enumerable expressions
…and-chunk-processing

; Conflicts:
;	.github/workflows/ci.yml
;	LiteDB.sln
;	LiteDB/Engine/Query/QueryOptimization.cs
Fix vector order lookup with composite ordering
…search-feature

Improve vector index tests with MathNet comparisons
…odel-and-chunk-processing

Introduce chunk-based vector search indexing
…nn-design

Implement ANN graph for vector index
…port

Integrate vector indexes into planner and tests
@JKamsker JKamsker requested a review from Copilot September 24, 2025 15:14
@JKamsker
Copy link
Collaborator Author

@codex review

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive vector similarity search capabilities to LiteDB, implementing vector indexing, storage, and query functionality. The implementation includes a new HNSW-based vector index, BsonVector type, vector similarity operations, and supporting CLI tools for document ingestion and search using embeddings.

  • Adds new BsonVector type with VECTOR_SIM expression support for cosine distance calculations
  • Implements HNSW-based vector indexing with cosine, Euclidean, and dot product distance metrics
  • Creates comprehensive vector search API with WhereNear and TopKNear query methods
  • Includes CLI demo tools for embedding text documents using Google Gemini and performing semantic search

Reviewed Changes

Copilot reviewed 65 out of 66 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
LiteDB/Document/BsonVector.cs New BsonVector type for storing float arrays as BSON values
LiteDB/Engine/Services/VectorIndexService.cs Core HNSW vector index implementation with search, insert, and delete operations
LiteDB/Engine/Structures/VectorIndexNode.cs Vector index node structure with multilevel neighbor management
LiteDB/Client/Database/LiteQueryable.cs Vector query API with WhereNear and TopKNear methods
LiteDB.Demo.Tools.VectorSearch/ Complete CLI demo for document embedding and semantic search
Comments suppressed due to low confidence (1)

LiteDB/Engine/Pages/BasePage.cs:1

  • The error message contains grammatical errors. Should be: 'only data/index/vector index pages can be deleted'
using System;

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

## License

[MIT](http://opensource.org/licenses/MIT)
</a>
Copy link
Preview

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The license section was removed from the README. Consider whether this was intentional or if the MIT license information should be preserved for clarity.

Suggested change
</a>
</a>
## License
LiteDB is licensed under the [MIT License](LICENSE).
Copyright (c) 2014-2024 Mauricio David

Copilot uses AI. Check for mistakes.

{
sb.AppendLine($"WHERE {string.Join(" AND ", this.Where.Select(x => x.Source))}");
}

Copy link
Preview

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WHERE clause logic was removed but only replaced with an empty line. This appears to be incomplete - the WHERE clause handling should be restored or properly integrated with the vector search logic.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was moved further down

firstBlock = block.Position;
}

Buffer.BlockCopy(vector, bytesWritten, block.Buffer.Array, block.Buffer.Offset, chunk);
Copy link
Preview

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer.BlockCopy expects byte arrays but 'vector' is a float array. This should use Buffer.BlockCopy(vector, bytesWritten * sizeof(float), block.Buffer.Array, block.Buffer.Offset, chunk) or convert the float array to bytes first.

Suggested change
Buffer.BlockCopy(vector, bytesWritten, block.Buffer.Array, block.Buffer.Offset, chunk);
Buffer.BlockCopy(vector, bytesWritten / sizeof(float), block.Buffer.Array, block.Buffer.Offset, chunk);

Copilot uses AI. Check for mistakes.

throw new LiteException(0, "Vector data block is corrupted.");
}

Buffer.BlockCopy(slice.Array, slice.Offset, vector, bytesCopied, available);
Copy link
Preview

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer.BlockCopy is copying bytes to a float array. The destination should be byte array or use unsafe code to properly handle the float array memory layout.

Copilot uses AI. Check for mistakes.


case BsonType.MinValue: return BsonValue.MinValue;
case BsonType.MaxValue: return BsonValue.MaxValue;
case BsonType.Vector: return buffer.ReadVector(offset);
Copy link
Preview

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ReadVector method returns float[] but this method returns BsonValue. The return should be wrapped: return new BsonVector(buffer.ReadVector(offset));

Suggested change
case BsonType.Vector: return buffer.ReadVector(offset);
case BsonType.Vector: return new BsonVector(buffer.ReadVector(offset));

Copilot uses AI. Check for mistakes.

{
page = this.GetPage<VectorIndexPage>(freeVectorPageList);

ENSURE(page.FreeBytes > bytesLength, "this page shout be space enouth for this new vector node");
Copy link
Preview

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple spelling errors in the error message: 'shout' should be 'should' and 'enouth' should be 'enough'

Copilot uses AI. Check for mistakes.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

foreach(var index in _collectionPage.GetCollectionIndexes())
{
// add head/tail (same page) to be deleted
indexPages.Add(index.Head.PageID);
foreach (var node in indexer.FindAll(index, Query.Ascending))
{
indexPages.Add(node.Page.PageID);

[P1] Skip-list cleanup assumes vector indexes use IndexPage layout

When dropping a collection the engine iterates every entry returned by GetCollectionIndexes() and treats them as skip-list indexes by reading index.Head and calling indexer.FindAll(index, …) to collect IndexPage IDs. Vector indexes stored in the same dictionary have IndexType == 1 and never populate Head/Tail; their metadata lives in VectorIndexMetadata.Root and nodes are backed by VectorIndexPage. Passing a vector index through this block either throws when index.Head.PageID is uint.MaxValue or simply skips releasing vector pages, leaving allocated VectorIndexPage/data blocks behind. The loop needs to exclude vector indexes or invoke a dedicated release routine that walks VectorIndexMetadata.Root instead of the skip-list path.


Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

@JKamsker JKamsker changed the title Full vector support Vector Search support Sep 27, 2025
@syan2018
Copy link

Hello! would these new features release on LiteDB 6.0? and can I tried it on nuget 6.0.0-prerelease?

@JKamsker
Copy link
Collaborator Author

JKamsker commented Sep 30, 2025

Hello! would these new features release on LiteDB 6.0? and can I tried it on nuget 6.0.0-prerelease?

@syan2018 Will be on nuget in 15-30 minutes 👍

@JKamsker
Copy link
Collaborator Author

@syan2018 https://www.nuget.org/packages/LiteDB/6.0.0-prerelease.52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants