feat: Vector Search and Similarity Indexing #2678
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes: #2364 and #2580
Ref: #2646
This pull request introduces a major new capability to LiteDB: Vector Search. This feature enables storing vector embeddings and performing efficient Approximate Nearest Neighbor (ANN) searches, making LiteDB suitable for a wide range of AI and machine learning applications, such as semantic search, recommendation engines, and Retrieval-Augmented Generation (RAG).
Key Features
Vector Storage & Indexing:
BsonVector
type (float[]
) is introduced for native and efficient storage of vector embeddings.VectorDistanceMetric.Cosine
(default)VectorDistanceMetric.Euclidean
VectorDistanceMetric.DotProduct
New Fluent Query API:
LiteDB.Vector
namespace:.WhereNear(x => x.Embedding, targetVector, maxDistance)
: Filters documents where the vector is within a specified distance of the target..TopKNear(x => x.Embedding, targetVector, k)
: Efficiently retrieves the top K nearest neighbors to a target vector, using the index to avoid full scans and sorting.SQL /
BsonExpression
Support:VECTOR_SIM(field, target_vector)
function and infix operator have been added, allowing vector similarity calculations directly within expressions and SQL queries.SELECT * FROM docs WHERE $.Embedding VECTOR_SIM @0 <= 0.25
End-to-End Demo Application
To showcase this new feature, a new demo project
LiteDB.Demo.Tools.VectorSearch
has been added. It's a command-line tool that provides a complete workflow for building a semantic search engine:.txt
,.md
), splits them into chunks, and generates embeddings using the Google Gemini API..TopKNear()
to find and display the most semantically similar document chunks from the database.Usage Example
Getting started with vector search is straightforward.
Additional Changes
Benchmarks
Credits
Thanks @hurley451 for the initial work