perf: batch metadata query in KB retrieval to fix N+1 problem#5463
perf: batch metadata query in KB retrieval to fix N+1 problem#5463Soulter merged 2 commits intoAstrBotDevs:masterfrom
Conversation
Replace N sequential get_document_with_metadata() calls with a single get_documents_with_metadata_batch() call using SQL IN clause. Benchmark results (local SQLite): - 10 docs: 10.67ms → 1.47ms (7.3x faster) - 20 docs: 26.00ms → 2.68ms (9.7x faster) - 50 docs: 63.87ms → 2.79ms (22.9x faster)
Summary of ChangesHello @CAICAIIs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求旨在通过优化知识库文档元数据的查询机制来大幅提升系统性能。通过引入批量查询方法并将其集成到文档检索流程中,成功解决了 N+1 查询问题,从而显著减少了数据库交互次数和响应时间。 Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Hey —— 我发现了 1 个问题,并给出了一些总体反馈:
- 新的
get_documents_with_metadata_batchAPI 可以接受更通用的Iterable[str](或Collection[str])而不是list[str],这样可以避免在调用方已经有 set 或其他 ID 可迭代对象时,还被迫构造一个 list。 - 如果在某些使用场景下
doc_ids可能会变得很大,建议将IN查询拆分为多个批次,以避免在单条语句中构造非常大的IN子句,从而触及 SQLite 的参数数量限制或影响性能。
供 AI Agents 使用的提示
Please address the comments from this code review:
## Overall Comments
- The new `get_documents_with_metadata_batch` API could accept a more general `Iterable[str]` (or `Collection[str]`) instead of `list[str]` to avoid forcing callers to materialize a list when they already have a set or other iterable of IDs.
- If `doc_ids` can grow large in some usage patterns, consider chunking the `IN` query into batches to avoid constructing a very large `IN` clause in a single statement, which can hit SQLite parameter limits or impact performance.
## Individual Comments
### Comment 1
<location path="astrbot/core/knowledge_base/kb_db_sqlite.py" line_range="281" />
<code_context>
+ KnowledgeBase,
+ col(KBDocument.kb_id) == col(KnowledgeBase.kb_id),
+ )
+ .where(col(KBDocument.doc_id).in_(doc_ids))
+ )
+ result = await session.execute(stmt)
</code_context>
<issue_to_address>
**issue (performance):** Large `doc_ids` inputs may lead to very large `IN` clauses and performance issues.
If `doc_ids` can be large, this `IN` clause may both degrade query performance and exceed SQLite’s parameter limits. Consider either batching `doc_ids` into smaller chunks and unioning the results, or enforcing a maximum size for `doc_ids` before building the query.
</issue_to_address>帮我变得更有用!请在每条评论上点击 👍 或 👎,我会根据你的反馈改进后续的评审。
Original comment in English
Hey - I've found 1 issue, and left some high level feedback:
- The new
get_documents_with_metadata_batchAPI could accept a more generalIterable[str](orCollection[str]) instead oflist[str]to avoid forcing callers to materialize a list when they already have a set or other iterable of IDs. - If
doc_idscan grow large in some usage patterns, consider chunking theINquery into batches to avoid constructing a very largeINclause in a single statement, which can hit SQLite parameter limits or impact performance.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `get_documents_with_metadata_batch` API could accept a more general `Iterable[str]` (or `Collection[str]`) instead of `list[str]` to avoid forcing callers to materialize a list when they already have a set or other iterable of IDs.
- If `doc_ids` can grow large in some usage patterns, consider chunking the `IN` query into batches to avoid constructing a very large `IN` clause in a single statement, which can hit SQLite parameter limits or impact performance.
## Individual Comments
### Comment 1
<location path="astrbot/core/knowledge_base/kb_db_sqlite.py" line_range="281" />
<code_context>
+ KnowledgeBase,
+ col(KBDocument.kb_id) == col(KnowledgeBase.kb_id),
+ )
+ .where(col(KBDocument.doc_id).in_(doc_ids))
+ )
+ result = await session.execute(stmt)
</code_context>
<issue_to_address>
**issue (performance):** Large `doc_ids` inputs may lead to very large `IN` clauses and performance issues.
If `doc_ids` can be large, this `IN` clause may both degrade query performance and exceed SQLite’s parameter limits. Consider either batching `doc_ids` into smaller chunks and unioning the results, or enforcing a maximum size for `doc_ids` before building the query.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Address review feedback: - Change doc_ids param from list[str] to set[str] to avoid unnecessary conversion - Chunk IN clause into batches of 900 to stay under SQLite's 999 parameter limit - Remove list() wrapping at call site, pass set directly
|
No docs changes were generated in this run (docs repo had no updates). Docs repo: AstrBotDevs/AstrBot-docs AI change summary (not committed):
Experimental bot notice:
|
|
cccc |
Modifications / 改动点
Screenshots or Test Results / 运行截图或测试结果
Benchmark results (local SQLite):
Checklist / 检查清单
requirements.txt和pyproject.toml文件相应位置。/ I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations inrequirements.txtandpyproject.toml.Summary by Sourcery
通过批量查询文档元数据来优化知识库检索,从而消除 N+1 访问模式。
增强内容:
Original summary in English
Summary by Sourcery
Optimize knowledge base retrieval by batching document metadata queries to eliminate an N+1 access pattern.
Enhancements: