feat: Improve search performance#195
feat: Improve search performance#195deepin-bot[bot] merged 2 commits intolinuxdeepin:develop/snipefrom
Conversation
This will improve search result filtering performance.
Reviewer's GuideIndexes ancestor directory paths for each file and updates the searcher to enforce both filename and path-prefix constraints via a single Lucene boolean query, removing client-side path filtering and bumping the index schema version. Sequence diagram for the updated search with ancestor path constraintsequenceDiagram
actor User
participant Searcher
participant BooleanQuery
participant IndexSearcher
participant LuceneIndex
User->>Searcher: search(path, query, wildcard_query, max_results)
activate Searcher
Searcher->>BooleanQuery: create finalQuery
alt wildcard_query is true
Searcher->>BooleanQuery: add WildcardQuery(file_name_lower) MUST
else
Searcher->>BooleanQuery: add ParserQuery(file_name) MUST
end
Searcher->>Searcher: canonical = canonicalize_filename(path)
Searcher->>BooleanQuery: add TermQuery(ancestor_paths = canonical) MUST
Searcher->>IndexSearcher: search(finalQuery, max_results)
activate IndexSearcher
IndexSearcher->>LuceneIndex: execute boolean query
LuceneIndex-->>IndexSearcher: TopDocs
deactivate IndexSearcher
IndexSearcher-->>Searcher: TopDocs
loop for each hit in TopDocs
Searcher->>LuceneIndex: doc(docId)
LuceneIndex-->>Searcher: Document(full_path, file_type, ...)
Searcher->>Searcher: serialize fields into result string
end
Searcher-->>User: vector<string> results
deactivate Searcher
Class diagram for updated Searcher and indexing helpersclassDiagram
class file_record {
+string full_path
+string file_name
+string file_type
+string file_ext
+string modify_time_str
+string file_size_str
+string pinyin
+string pinyin_acronym
+bool is_hidden
}
class Document {
+add(field : FieldPtr) : void
}
class Field {
+name : wstring
+value : wstring
+STORE_NO : int
+STORE_YES : int
+INDEX_NOT_ANALYZED : int
}
class IndexerHelpers {
+add_ancestor_paths(doc : DocumentPtr, full_path : string) : void
+create_document(record : file_record) : DocumentPtr
}
class Searcher {
+reader : IndexReaderPtr
+searcher : IndexSearcherPtr
+search(path : string, query : string, wildcard_query : bool, max_results : int) : vector~string~
}
class BooleanQuery {
+add(query : QueryPtr, occur : BooleanClauseOccur) : void
}
class Queries {
+WildcardQuery(term : TermPtr)
+TermQuery(term : TermPtr)
+parse_with_analyzer(field : wstring, query_string : wstring) : QueryPtr
}
file_record --> IndexerHelpers : used by
IndexerHelpers --> Document : populates
IndexerHelpers --> Field : creates
Searcher ..> BooleanQuery : builds finalQuery
Searcher ..> Queries : creates filename and
path queries
Searcher ..> Document : reads indexed fields
Document --> Field : contains multiple
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- The path indexing in
add_ancestor_pathsassumes/as the separator; if this code ever runs on Windows-style paths (e.g.,C:\dir\file), consider normalizing paths or usingstd::filesystemto derive parent directories to avoid mismatches with the search-side canonicalization. - The new path constraint uses an exact
TermQueryonancestor_paths; this changes semantics when thepathargument points to a file rather than a directory (or is empty), so it’s worth confirming or enforcing thatpathis always a directory and handling or rejecting other inputs explicitly. - Since the filtering has moved entirely into the Lucene query, the debug log still printing
reader->numDocs()can be misleading when diagnosing path-filtered results; consider logging the final query or the number of hits instead/as well to aid troubleshooting of the new behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The path indexing in `add_ancestor_paths` assumes `/` as the separator; if this code ever runs on Windows-style paths (e.g., `C:\dir\file`), consider normalizing paths or using `std::filesystem` to derive parent directories to avoid mismatches with the search-side canonicalization.
- The new path constraint uses an exact `TermQuery` on `ancestor_paths`; this changes semantics when the `path` argument points to a file rather than a directory (or is empty), so it’s worth confirming or enforcing that `path` is always a directory and handling or rejecting other inputs explicitly.
- Since the filtering has moved entirely into the Lucene query, the debug log still printing `reader->numDocs()` can be misleading when diagnosing path-filtered results; consider logging the final query or the number of hits instead/as well to aid troubleshooting of the new behavior.
## Individual Comments
### Comment 1
<location> `src/daemon/src/core/file_index_manager.cpp:124` </location>
<code_context>
+ * 3. "/home"
+ * 4. "/"
+ */
+static void add_ancestor_paths(DocumentPtr doc, const std::string &full_path)
+{
+ if (full_path.empty()) return;
</code_context>
<issue_to_address>
**issue (complexity):** Consider rewriting the ancestor-path building loop using std::filesystem::path to simplify path manipulation and clarify loop termination.
You can reduce the custom string/path handling and make the termination conditions clearer by delegating to `std::filesystem::path` and using an explicit loop condition instead of `while (true)` + multiple `break`s.
For example:
```cpp
#include <filesystem>
// namespace fs = std::filesystem; // if you use a namespace alias
static void add_ancestor_paths(DocumentPtr doc, const std::string& full_path)
{
if (full_path.empty()) return;
std::filesystem::path p(full_path);
// Start from the parent directory of the file
for (p = p.parent_path(); !p.empty(); p = p.parent_path()) {
auto s = p.generic_string(); // normalized '/' separators
// Map the root path consistently to "/"
if (p == p.root_path()) {
s = "/";
}
doc->add(newLucene<Field>(
ANCESTOR_PATHS_FIELD,
StringUtils::toUnicode(s),
Field::STORE_NO,
Field::INDEX_NOT_ANALYZED));
// Once we've added the root, stop
if (p == p.root_path()) {
break;
}
}
}
```
This keeps all existing behavior:
- Skips pure filenames without a directory component.
- Adds all ancestor directories from the immediate parent up to and including `/`.
- Avoids manual `find_last_of('/')`, `substr`, and explicit root special cases in multiple branches.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
deepin pr auto review我来对这段代码进行详细审查:
在
具体改进代码示例: // 建议将宏定义改为常量
const std::wstring ANCESTOR_PATHS_FIELD = L"ancestor_paths";
// 改进后的 add_ancestor_paths 函数
static void add_ancestor_paths(DocumentPtr doc, const std::string &full_path) {
if (full_path.empty() || full_path.length() > PATH_MAX) return;
// 规范化路径
std::string normalized_path = std::filesystem::canonical(full_path).string();
// 防止路径遍历攻击
if (normalized_path.find("..") != std::string::npos) return;
std::string current_path = normalized_path;
// 获取父目录
size_t last_slash = current_path.find_last_of('/');
if (last_slash == std::string::npos) return;
// 处理根目录情况
if (last_slash == 0) {
current_path = "/";
} else {
current_path = current_path.substr(0, last_slash);
}
// 使用 vector 预分配空间
std::vector<std::string> ancestors;
ancestors.reserve(16); // 预估合理的路径深度
while (true) {
ancestors.push_back(current_path);
if (current_path == "/" || current_path.empty()) {
break;
}
last_slash = current_path.find_last_of('/');
if (last_slash == std::string::npos) break;
current_path = (last_slash == 0) ? "/" : current_path.substr(0, last_slash);
}
// 批量添加到文档
for (const auto& path : ancestors) {
doc->add(newLucene<Field>(ANCESTOR_PATHS_FIELD,
StringUtils::toUnicode(path),
Field::STORE_NO,
Field::INDEX_NOT_ANALYZED));
}
}
// 改进后的搜索结果构建函数
static std::string build_search_result(const DocumentPtr& doc) {
std::stringstream ss;
ss << StringUtils::toUTF8(doc->get(L"full_path"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"file_type"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"file_ext"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"modify_time_str"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"file_size_str"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"pinyin"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"pinyin_acronym"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"is_hidden"));
return ss.str();
}这些改进主要关注了:
建议在实际应用中根据具体需求和环境进一步调整这些改进。 |
This will improve search result filtering performance.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lzwind, wangrong1069 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/forcemerge |
|
This pr force merged! (status: unstable) |
26381f8
into
linuxdeepin:develop/snipe
Summary by Sourcery
Index ancestor directory paths for each file and use them for path-prefixed filtering during search, updating the index version accordingly.
New Features:
Enhancements: