-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Vector Search support #2646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector Search support #2646
Conversation
…im-function Support vector similarity operator in predicates
Fix vector serialization and comparison
Fix vector serialization and comparison
…t-in-jsonwriter Handle BsonVector in JSON writer
…ngestcommand Fix vector index mapping for enumerable expressions
…and-chunk-processing ; Conflicts: ; .github/workflows/ci.yml ; LiteDB.sln ; LiteDB/Engine/Query/QueryOptimization.cs
Fix vector order lookup with composite ordering
…search-feature Improve vector index tests with MathNet comparisons
…odel-and-chunk-processing Introduce chunk-based vector search indexing
…nn-design Implement ANN graph for vector index
…port Integrate vector indexes into planner and tests
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces comprehensive vector similarity search capabilities to LiteDB, implementing vector indexing, storage, and query functionality. The implementation includes a new HNSW-based vector index, BsonVector type, vector similarity operations, and supporting CLI tools for document ingestion and search using embeddings.
- Adds new BsonVector type with VECTOR_SIM expression support for cosine distance calculations
- Implements HNSW-based vector indexing with cosine, Euclidean, and dot product distance metrics
- Creates comprehensive vector search API with WhereNear and TopKNear query methods
- Includes CLI demo tools for embedding text documents using Google Gemini and performing semantic search
Reviewed Changes
Copilot reviewed 65 out of 66 changed files in this pull request and generated 6 comments.
Show a summary per file
File | Description |
---|---|
LiteDB/Document/BsonVector.cs | New BsonVector type for storing float arrays as BSON values |
LiteDB/Engine/Services/VectorIndexService.cs | Core HNSW vector index implementation with search, insert, and delete operations |
LiteDB/Engine/Structures/VectorIndexNode.cs | Vector index node structure with multilevel neighbor management |
LiteDB/Client/Database/LiteQueryable.cs | Vector query API with WhereNear and TopKNear methods |
LiteDB.Demo.Tools.VectorSearch/ | Complete CLI demo for document embedding and semantic search |
Comments suppressed due to low confidence (1)
LiteDB/Engine/Pages/BasePage.cs:1
- The error message contains grammatical errors. Should be: 'only data/index/vector index pages can be deleted'
using System;
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
## License | ||
|
||
[MIT](http://opensource.org/licenses/MIT) | ||
</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The license section was removed from the README. Consider whether this was intentional or if the MIT license information should be preserved for clarity.
</a> | |
</a> | |
## License | |
LiteDB is licensed under the [MIT License](LICENSE). | |
Copyright (c) 2014-2024 Mauricio David |
Copilot uses AI. Check for mistakes.
{ | ||
sb.AppendLine($"WHERE {string.Join(" AND ", this.Where.Select(x => x.Source))}"); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WHERE clause logic was removed but only replaced with an empty line. This appears to be incomplete - the WHERE clause handling should be restored or properly integrated with the vector search logic.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was moved further down
firstBlock = block.Position; | ||
} | ||
|
||
Buffer.BlockCopy(vector, bytesWritten, block.Buffer.Array, block.Buffer.Offset, chunk); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer.BlockCopy expects byte arrays but 'vector' is a float array. This should use Buffer.BlockCopy(vector, bytesWritten * sizeof(float), block.Buffer.Array, block.Buffer.Offset, chunk)
or convert the float array to bytes first.
Buffer.BlockCopy(vector, bytesWritten, block.Buffer.Array, block.Buffer.Offset, chunk); | |
Buffer.BlockCopy(vector, bytesWritten / sizeof(float), block.Buffer.Array, block.Buffer.Offset, chunk); |
Copilot uses AI. Check for mistakes.
throw new LiteException(0, "Vector data block is corrupted."); | ||
} | ||
|
||
Buffer.BlockCopy(slice.Array, slice.Offset, vector, bytesCopied, available); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer.BlockCopy is copying bytes to a float array. The destination should be byte array or use unsafe code to properly handle the float array memory layout.
Copilot uses AI. Check for mistakes.
|
||
case BsonType.MinValue: return BsonValue.MinValue; | ||
case BsonType.MaxValue: return BsonValue.MaxValue; | ||
case BsonType.Vector: return buffer.ReadVector(offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ReadVector method returns float[] but this method returns BsonValue. The return should be wrapped: return new BsonVector(buffer.ReadVector(offset));
case BsonType.Vector: return buffer.ReadVector(offset); | |
case BsonType.Vector: return new BsonVector(buffer.ReadVector(offset)); |
Copilot uses AI. Check for mistakes.
{ | ||
page = this.GetPage<VectorIndexPage>(freeVectorPageList); | ||
|
||
ENSURE(page.FreeBytes > bytesLength, "this page shout be space enouth for this new vector node"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple spelling errors in the error message: 'shout' should be 'should' and 'enouth' should be 'enough'
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
LiteDB/LiteDB/Engine/Services/SnapShot.cs
Lines 680 to 688 in 99e8720
foreach(var index in _collectionPage.GetCollectionIndexes()) | |
{ | |
// add head/tail (same page) to be deleted | |
indexPages.Add(index.Head.PageID); | |
foreach (var node in indexer.FindAll(index, Query.Ascending)) | |
{ | |
indexPages.Add(node.Page.PageID); | |
[P1] Skip-list cleanup assumes vector indexes use IndexPage layout
When dropping a collection the engine iterates every entry returned by GetCollectionIndexes()
and treats them as skip-list indexes by reading index.Head
and calling indexer.FindAll(index, …)
to collect IndexPage
IDs. Vector indexes stored in the same dictionary have IndexType == 1
and never populate Head
/Tail
; their metadata lives in VectorIndexMetadata.Root
and nodes are backed by VectorIndexPage
. Passing a vector index through this block either throws when index.Head.PageID
is uint.MaxValue
or simply skips releasing vector pages, leaving allocated VectorIndexPage
/data blocks behind. The loop needs to exclude vector indexes or invoke a dedicated release routine that walks VectorIndexMetadata.Root
instead of the skip-list path.
Reply with @codex fix comments
to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
…ector-index-feature' into temp/pr-merge-21-24
…ector-index-feature-sc7x7j' into temp/pr-merge-21-24
…ector-index-feature-3uorv1' into temp/pr-merge-21-24
…ector-index-feature-hhom1x' into temp/pr-merge-21-24
Vector index bug
…ated-interfaces-and-logic Move vector APIs to LiteDB.Vector extension surface
Hello! would these new features release on LiteDB 6.0? and can I tried it on nuget 6.0.0-prerelease? |
@syan2018 Will be on nuget in 15-30 minutes 👍 |
Fixes: #2364 and #2580
This pull request introduces vector similarity search and benchmarking capabilities to the LiteDB project, along with new CLI tooling for document ingestion and search using embeddings. The main changes include adding vector fields to benchmark models, implementing benchmarks for vector similarity queries, and creating extensible CLI commands for ingesting and searching documents with vector embeddings. The changes are grouped below by theme.
Vector Similarity Benchmarking:
QueryWithVectorSimilarity
toLiteDB.Benchmarks/Benchmarks/Queries/QueryWithVectorSimilarity.cs
, enabling performance testing of vector similarity queries (both indexed and unindexed) using the newVectors
field.FileMetaBase
model to include afloat[] Vectors
property, supporting storage of vector embeddings for similarity search.FileMetaGenerator
to generate random 128-dimensional vectors for each document, ensuring benchmark data includes vector embeddings. [1] [2]Vector Search CLI Tooling:
IngestCommand
inLiteDB.Demo.Tools.VectorSearch/Commands/IngestCommand.cs
, which ingests documents from a directory, splits them into chunks, generates embeddings, and stores them in a database with chunk-level vector indexing. Includes options for skipping unchanged files, pruning missing files, and progress reporting.SearchCommand
inLiteDB.Demo.Tools.VectorSearch/Commands/SearchCommand.cs
, allowing users to search for documents via semantic similarity using embeddings, with support for top-K results, max distance filtering, and customizable output.VectorSearchCommandSettings
to handle configuration, authentication, and validation for embedding services and database paths in CLI commands.Benchmarks
Credits
Thanks @hurley451 for the initial work