Implement enhanced vector search with similarity filtering, improved segmentation, and hybrid ranking #164
+1,909
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The existing vector search framework had a significant UX issue where users would frequently get the same content when searching with different keywords. Analysis revealed several root causes:
This resulted in poor search relevance and user frustration.
Solution
Created a new
TelegramSearchBot.Vector
library that enhances the existing FAISS vector search with four key improvements:1. Similarity Threshold Filtering
Implemented configurable L2 distance threshold (default: 1.5) to filter out low-quality results:
This immediately removes irrelevant matches before they reach the user.
2. Multi-Dimensional Conversation Segmentation
Replaced the simple count-based segmentation with intelligent topic detection that considers:
This creates cleaner topic boundaries and prevents unrelated content from being grouped together.
3. Hybrid Ranking System
Combines vector similarity with keyword matching for better relevance:
Results now favor content that is both semantically similar AND contains the actual search terms.
4. Content-Based Deduplication
Uses SHA-256 content hashing to identify and remove duplicates, keeping only the highest-relevance result per unique content.
Implementation Details
Architecture
The solution is implemented as a standalone library (
TelegramSearchBot.Vector
) that wraps and enhances the existingFaissVectorService
:Backward Compatibility
The enhanced search is opt-in via configuration:
When disabled, the original vector search behavior is completely unchanged. No breaking changes to existing functionality.
Testing
Comprehensive test suite with 100% pass rate:
Usage
Once enabled in configuration, enhanced search is automatically used for vector searches:
Results now include detailed scoring information:
Advanced Usage
The service can also be used directly for re-segmentation or statistics:
Results
This implementation provides:
✅ More relevant results through similarity threshold filtering
✅ No duplicates via content-based deduplication
✅ Better ranking with hybrid vector + keyword scoring
✅ Cleaner topic segments using multi-dimensional detection
✅ Backward compatible with existing search functionality
✅ Well tested with comprehensive unit test coverage
✅ Fully documented with usage guides and implementation details
Documentation
Added comprehensive documentation:
TelegramSearchBot.Vector/README.md
- Usage guide with examplesDocs/Vector_Search_Improvements.md
- Implementation details and architecturePerformance Impact
Fixes the core issue where different search keywords would return the same irrelevant content, improving search quality and user satisfaction.
Original prompt
implement this.
The following is a snapshot of a chat conversation between a user and an AI coding assistant. Prioritize later messages in the conversation.
User: 不需要你修改代码,帮我思考一下现有的向量搜索框架是否合理,有什么潜在问题。向量搜索框架指的是从用户输入到程序显示为止的全联路
AI: 基于我对代码的分析,我来帮你分析一下现有向量搜索框架的合理性和潜在问题。向量搜索的全流程大致如下:
现有向量搜索框架流程
1. 用户输入处理
"向量搜索 查询内容"
AI: 检测前缀,创建
AI: 对象
2. 搜索执行
AI: 根据聊天类型选择策略:
AI: 执行具体搜索
3. 向量生成与搜索
AI: )将查询文本转换为 1024 维向量
4. 结果处理与显示
AI: 渲染结果,显示相似度分数、搜索类型等
潜在问题分析
✅ 架构优势
1. 性能问题
2. 并发与一致性
AI: 锁粒度较大,可能影响并发写入
AI: 集合管理索引写盘,但缺乏持久化保证
MaxParallelVectorization = 4
硬编码,可能不适合所有环境3. 数据一致性
4. 搜索质量问题
5. 可扩展性限制
6. 错误处理与监控
7. 资源管理
AI: 缓存缺乏 LRU 或大小限制
建议改进方向
总体来说,框架在功能完整性和技术选型上是合理的,但是在性能优化、并发处理和可扩展性方面还有提升空间。建议优先解决私聊搜索效率和索引一致性问题。
User: 我是觉得这个结构下用户交互有问题,经常容易用不同的搜索关键词搜索到相同的内容,我想知道是什么问题导致的
AI: 基于我对代码的深入分析,我发现了导致"用不同的搜索关键词搜索到相同的内容"的几个关键问题:
主要问题分析
1. 搜索结果数量过多,无质量过滤
2. 缺乏相似度阈值过滤
3. 对话段划分过于宽泛
4. 向量内容包含过多元信息
5. 群聊搜索完全没有去重机制
6. 私聊搜索去重不够完善
根本原因
核心问题:向量搜索的"相似度"概念过于宽泛,没有区分"内容相关"和"结构相似"。
建议的解决方案
这些问题导致用户感觉"搜索不准",明明用不同关键词却总搜到相同内容。
User: 那么相关的实现方案有什么建议嘛?
AI: 基于之前分析的向量搜索框架问题(特别是用户交互中用不同关键词搜索到相同内容的现象),我建议以下实现方案。这些方案针对核心问题:相似度阈值过滤、对话段划分优化、向量内容改进、搜索结果去重和重排序。建议优先级从高到低排序。
1. 添加相似度阈值过滤(最高优先级)
AI: 方法,添加阈值参数。
2. 优化对话段划分策略
AI: 的分段逻辑,基于话题变化或时间间隔更精细划分。
AI: 方法,添加话题检测。