Skip to content

Feat ocr#274

Merged
deepin-bot[bot] merged 20 commits intolinuxdeepin:masterfrom
Johnson-zs:feat_ocr
Apr 20, 2026
Merged

Feat ocr#274
deepin-bot[bot] merged 20 commits intolinuxdeepin:masterfrom
Johnson-zs:feat_ocr

Conversation

@Johnson-zs
Copy link
Copy Markdown
Contributor

No description provided.

as title

Log:
as title

Log:
Changed the content index directory from GenericConfigLocation to
GenericDataLocation and updated the folder name from "index" to
"fulltext-index". This modification ensures that the search index
files are stored in the appropriate data directory rather than the
configuration directory, which is more suitable for large index files
and follows better data management practices.

Log: Moved search index storage location to data directory

Influence:
1. Verify that search index files are now created in the correct data
directory
2. Test that full-text search functionality still works correctly after
the directory change
3. Check that old index files are properly migrated or new ones are
created
4. Ensure search performance is not affected by the directory change
5. Verify that the application handles the directory transition smoothly

fix: 更改内容索引目录路径

将内容索引目录从GenericConfigLocation更改为GenericDataLocation,并将文件
夹名称从"index"更新为"fulltext-index"。此修改确保搜索索引文件存储在适当
的数据目录中,而不是配置目录中,这更适合大型索引文件,并遵循更好的数据管
理实践。

Log: 将搜索索引存储位置移至数据目录

Influence:
1. 验证搜索索引文件现在是否在正确的数据目录中创建
2. 测试目录更改后全文搜索功能是否仍然正常工作
3. 检查旧索引文件是否正确迁移或新索引文件是否创建
4. 确保搜索性能不受目录更改的影响
5. 验证应用程序是否能平稳处理目录转换
Replace hardcoded Lucene field names with centralized constants from
LuceneFieldNames header to improve maintainability and reduce errors.
Added new header file lucene_field_names.h containing all field name
constants organized by index type (FileName and Content). This change
ensures consistency across the codebase and makes field name changes
easier to manage.

Influence:
1. Verify content search functionality still works correctly
2. Test filename search with various query types
3. Check hidden file filtering in both content and filename searches
4. Validate path prefix query optimization
5. Test pinyin and acronym search functionality
6. Verify detailed search results display correctly

refactor: 将 Lucene 字段名集中管理为常量

使用 LuceneFieldNames 头文件中的集中式常量替换硬编码的 Lucene 字段名,
提高可维护性并减少错误。新增 lucene_field_names.h 头文件,包含按索引类型
(FileName 和 Content)组织的所有字段名常量。此更改确保代码库的一致性,
并使字段名更改更易于管理。

Influence:
1. 验证内容搜索功能是否正常工作
2. 测试各种查询类型的文件名搜索
3. 检查内容和文件名搜索中的隐藏文件过滤
4. 验证路径前缀查询优化
5. 测试拼音和简拼搜索功能
6. 确认详细搜索结果正确显示
1. Implement OCR text search engine with Lucene index support
2. Add OCR text search API classes for options and results handling
3. Extend search type enum to include OCR search type
4. Add OCR-specific error codes and error handling
5. Implement OCR text indexed search strategy with advanced query logic
6. Support mixed AND search across OCR contents and filename fields
7. Add OCR text index directory management and version checking
8. Update search client to support OCR search type
9. Move field names header to public include directory

Log: Added OCR text search capability for searching text extracted from
images

Influence:
1. Test OCR text search with various query types (simple, boolean)
2. Verify mixed AND search behavior across OCR contents and filename
fields
3. Test OCR search error handling with short keywords and unsupported
wildcards
4. Validate OCR index directory management and version checking
5. Test integration with existing filename and content search
functionality
6. Verify search client supports all three search types (filename,
content, ocr)

feat: 添加 OCR 文本搜索支持

1. 实现基于 Lucene 索引的 OCR 文本搜索引擎
2. 添加 OCR 文本搜索 API 类用于选项和结果处理
3. 扩展搜索类型枚举以包含 OCR 搜索类型
4. 添加 OCR 特定错误码和错误处理机制
5. 实现 OCR 文本索引搜索策略,支持高级查询逻辑
6. 支持跨 OCR 内容和文件名字段的混合 AND 搜索
7. 添加 OCR 文本索引目录管理和版本检查功能
8. 更新搜索客户端以支持 OCR 搜索类型
9. 将字段名称头文件移至公共包含目录

Log: 新增 OCR 文本搜索功能,支持搜索从图像中提取的文本

Influence:
1. 测试使用不同查询类型(简单、布尔)的 OCR 文本搜索
2. 验证跨 OCR 内容和文件名字段的混合 AND 搜索行为
3. 测试短关键词和不支持通配符时的 OCR 搜索错误处理
4. 验证 OCR 索引目录管理和版本检查功能
5. 测试与现有文件名和内容搜索功能的集成
6. 验证搜索客户端支持所有三种搜索类型(文件名、内容、OCR)
The existing path matching logic using simple string prefix matching
was insufficient for accurately determining if a path belongs to indexed
directories. This could lead to false positives when paths share common
prefixes but are not actually subdirectories.

Changes made:
1. Added new helper function isPathInAnyDirectory with proper path
normalization
2. Implemented exact path matching for when the path is the indexed
directory itself
3. Added proper path separator handling to ensure accurate subdirectory
detection
4. Replaced duplicate logic in three different index checking functions
with the new utility function

The fix ensures that path matching considers both exact directory
matches and proper subdirectory relationships with correct path
separation.

Influence:
1. Test path matching with exact directory paths
2. Verify subdirectory detection with various path depths
3. Test paths that share common prefixes but are not subdirectories
4. Validate path normalization handles trailing slashes correctly
5. Confirm blacklist functionality still works properly

fix: 改进内容索引路径匹配逻辑

原有使用简单字符串前缀匹配的逻辑不足以准确判断路径是否属于索引目录,可能
导致路径共享相同前缀但实际并非子目录时出现误判。

具体修改:
1. 新增辅助函数 isPathInAnyDirectory,实现路径规范化处理
2. 当路径就是索引目录本身时实现精确路径匹配
3. 添加正确的路径分隔符处理以确保准确的子目录检测
4. 将三个不同索引检查函数中的重复逻辑替换为新工具函数

此修复确保路径匹配同时考虑精确目录匹配和具有正确路径分隔符的子目录关系。

Influence:
1. 测试精确目录路径的匹配情况
2. 验证不同路径深度的子目录检测
3. 测试共享相同前缀但并非子目录的路径
4. 确认路径规范化正确处理尾部斜杠
5. 验证黑名单功能仍正常工作
1. Added kBirthTimeTime constant to Lucene field names
2. Included the new field in three namespaces: default, Content and
OcrText
3. Enables tracking and searching by file creation/birth time in search
functionality

Influence:
1. Test search functionality using birth time field
2. Verify file creation time indexing works correctly
3. Check compatibility with existing search queries

feat: 添加文件创建时间字段常量

1. 在Lucene字段名称中添加了kBirthTimeTime常量
2. 包含了default、Content和OcrText三个命名空间的字段名称
3. 支持通过文件创建/诞生时间进行搜索的功能

Influence:
1. 测试使用创建时间字段的搜索功能
2. 验证文件创建时间索引是否正确工作
3. 检查与现有搜索查询的兼容性
1. Fixed typo in birth_time field name (changed from kBirthTimeTime
to kBirthTime)
2. Added new kModifyTime field (numeric timestamp) alongside existing
kModifyTimeStr
3. Ensured consistency in field names across different namespaces
(Content and OcrText)
4. Maintained backward compatibility while improving naming clarity

Log: Modified search index field names for better consistency

Influence:
1. Verify existing search queries still work with the corrected field
names
2. Test that new modify_time field is properly indexed and searchable
3. Check that time-based searches work correctly with both birth_time
and modify_time
4. Validate backward compatibility with existing indexed data

refactor: 修正字段名称并添加 modify_time

1. 修正了 birth_time 字段名的拼写错误(从 kBirthTimeTime 改为 kBirthTime)
2. 新增了 kModifyTime 字段(数值时间戳),与现有的 kModifyTimeStr 并存
3. 确保不同命名空间(Content 和 OcrText)中的字段名称保持一致
4. 在提高命名清晰度的同时保持了向后兼容性

Log: 修改搜索索引字段名称以提高一致性

Influence:
1. 验证现有搜索查询仍能使用修正后的字段名正常工作
2. 测试新添加的 modify_time 字段能否正确被索引和搜索
3. 检查时间相关的搜索功能是否能正确处理 birth_time 和 modify_time
4. 验证与已有索引数据的向后兼容性
1. Added TimeRangeFilter class with fluent interface for time-based
queries
2. Added support for time filtering in indexed and real-time search
strategies
3. Implemented TimeField (birth/modify time) and TimeUnit enumerations
4. Added test cases covering all time range combinations and boundary
conditions
5. Added time filtering to content search, file name search and OCR
text search

Log: Added time range filtering support for file searching

Influence:
1. Test file search with different time ranges (today, last week, custom
range)
2. Verify both creation time and modification time filtering
3. Test boundary conditions for inclusive/exclusive ranges
4. Test combined keyword and time range searches
5. Verify real-time search updates with time filter changes

feat: 实现搜索的时间范围过滤功能

1. 添加TimeRangeFilter类,支持流畅接口的时间查询
2. 在索引和实时搜索策略中添加时间过滤支持
3. 实现TimeField(创建/修改时间)和TimeUnit枚举类型
4. 添加覆盖所有时间范围组合和边界条件的测试用例
5. 为内容搜索、文件名搜索和OCR文本搜索增加时间过滤功能

Log: 为文件搜索添加时间范围过滤支持

Influence:
1. 测试不同时间范围的文件搜索(今日、上周、自定义范围)
2. 验证创建时间和修改时间过滤
3. 测试包含/排除边界的条件
4. 测试关键词和时间范围组合搜索
5. 验证实时搜索在时间过滤器更改时的更新
Refactored the search client to implement a comprehensive CLI option
system with:
1. Added CliOptions class for parsing command line arguments
2. Implemented TextOutput and JsonOutput formatters for different output
formats
3. Added time range filtering capabilities with TimeParser utility
4. Reorganized main.cpp into cleaner modular structure
5. Added CMake entries for new source files

The changes enable more flexible command-line usage with configurable
search parameters, time filters and output formats while maintaining
backward compatibility.

Log: Added advanced CLI options and JSON output format to search client

Influence:
1. Test all CLI options combinations
2. Verify JSON and text output formats
3. Test time range filters with various formats
4. Check error handling for invalid inputs
5. Verify file type and extension filters

feat: 为搜索客户端添加CLI选项和输出格式器

重构搜索客户端实现全面的命令行选项系统:
1. 添加CliOptions类解析命令行参数
2. 实现TextOutput和JsonOutput格式器支持不同输出格式
3. 增加TimeParser工具类支持时间范围过滤
4. 重组main.cpp为更清晰的模块化结构
5. 在CMake中添加新源文件条目

这些变更提供了更灵活的命令行用法,可配置搜索参数、时间过滤和输出格式,同
时保持向后兼容性。

Log: 为搜索客户端添加高级CLI选项和JSON输出格式

Influence:
1. 测试所有CLI选项组合
2. 验证JSON和文本输出格式
3. 测试各种格式的时间范围过滤
4. 检查无效输入的错误处理
5. 验证文件类型和扩展名过滤
1. Added TimeResultAPI class to centralize time-related operations
2. Expanded metadata in ContentResultAPI, FileNameResultAPI, and
OcrTextResultAPI with:
   - Extended file attributes (name, extension, hidden status)
   - Detailed time information (creation/modification timestamps and
formatted strings)
3. Implemented verbose output mode in CLI with -v flag showing full
metadata
4. Improved JSON output to include complete file metadata when verbose
5. Deprecated legacy time-related methods in favor of timestamp-based
API
6. Added new field processing in all search strategies to populate
extended metadata

Log: Enhanced search results with detailed metadata and improved output
options

Influence:
1. Test all search types (filename, content, OCR) with verbose mode
2. Verify JSON output contains all metadata fields
3. Check time-related filtering with new timestamp fields
4. Test backward compatibility with legacy modifiedTime() method
5. Verify output formatting in both simple and verbose modes
6. Test hidden file detection and display

feat: 增强搜索结果的元数据和输出功能

1. 添加 TimeResultAPI 类来集中处理时间相关操作
2. 扩展 ContentResultAPI, FileNameResultAPI 和 OcrTextResultAPI 的元
数据:
   - 扩展文件属性(名称、扩展名、隐藏状态)
   - 详细时间信息(创建/修改时间戳和格式化字符串)
3. 在 CLI 中实现详细输出模式(-v 标志显示完整元数据)
4. 改进 JSON 输出以包含完整的文件元数据(详细模式时)
5. 废弃旧版时间相关方法,推荐使用基于时间戳的 API
6. 在所有搜索策略中添加新字段处理以填充扩展元数据

Log: 增强搜索结果详情的元数据和改进输出选项

Influence:
1. 使用详细模式测试所有搜索类型(文件名、内容、OCR)
2. 验证 JSON 输出包含所有元数据字段
3. 使用新时间戳字段检查时间相关过滤
4. 测试与旧版 modifiedTime() 方法的向后兼容性
5. 验证简单和详细模式下的输出格式
6. 测试隐藏文件检测和显示
Improved the file search strategy by removing dedicated FileType and
FileExt search types in favor of unified Combined search handling. Now
all file type/extension searches are processed through the Combined
search path for better consistency and maintainability.

Key changes:
1. Removed FileType and FileExt search types from enum and related
processing code
2. Modified determineSearchType to always use Combined when file types/
extensions are present without keywords
3. Simplified buildIndexQuery by removing dedicated file type/extension
cases
4. Improved combined search logic with better conditional processing
5. Updated comments to reflect the unified search approach

The change was made to reduce code complexity and provide more
consistent search behavior regardless of whether a keyword is present.
The Combined search path already handles all necessary functionality.

Influence:
1. Verify all file searches still work correctly with and without
keywords
2. Test combinations of keywords with file types/extensions
3. Check basic file type searches without keywords
4. Validate boolean search functionality
5. Test pinyin and pinyin acronym searches

refactor: 优化文件搜索策略逻辑

通过移除专用的文件类型和后缀搜索类型,改进了文件搜索策略,转为统一使用组
合搜索处理,以提高一致性和可维护性。

主要变更:
1. 从枚举中移除文件类型和后缀搜索类型及相关处理代码
2. 修改determineSearchType在没有关键词但有文件类型/后缀时始终使用组合
搜索
3. 简化buildIndexQuery逻辑,移除专用文件类型/后缀处理分支
4. 改进组合搜索的条件处理逻辑
5. 更新注释以反映统一搜索方法

此变更旨在降低代码复杂性,并确保无论是否存在关键词,都能提供一致的搜索行
为。组合搜索路径已包含所有必要的功能。

影响范围:
1. 验证带关键词和不带关键词的文件搜索仍能正常工作
2. 测试关键词与文件类型/后缀的组合搜索
3. 检查不带关键词的基本文件类型搜索
4. 验证布尔搜索功能
5. 测试拼音和拼音首字母搜索功能
The change updates the DFM search client binary name from 'dfm6-
search-client' to 'dfm-searcher' for better clarity and consistency.
Additionally, the installation path has been moved from libexec to bin
directory to make the binary more accessible on the system path.

This modification includes:
1. Changed project name in CMakeLists.txt from version-specific 'dfm6-
search-client' to consistent 'dfm-searcher'
2. Updated binary installation path from libexec to standard bin
directory
3. Modified package install configuration accordingly

The change improves usability by making the search tool more
discoverable and aligns with standard binary naming conventions.

chore: 更新搜索二进制文件名和安装路径

本次更改将DFM搜索客户端二进制文件名从'dfm6-search-client'更新为'dfm-
searcher'以提高清晰度和一致性。同时文件安装路径从libexec移动到了bin目
录,使得二进制文件在系统路径中更易访问。

具体修改包括:
1. CMakeLists.txt中的项目名称从版本特定名称'dfm6-search-client'改为一致
的'dfm-searcher'
2. 二进制文件安装路径从libexec改为标准bin目录
3. 相应更新了软件包安装配置

此项修改通过使搜索工具更易发现来提升可用性,同时符合标准二进制文件命名
规范。
1. Refactored isPathInContentIndexDirectory to use static cached
directory list
2. Refactored isPathInOcrTextIndexDirectory similarly to reduce repeated
calls
3. Optimized isPathInFileNameIndexDirectory with cached lists for both
blacklist and indexed directories
4. Changes improve performance by avoiding repeated calls to directory
fetching functions

The modifications cache directory lists as static variables to prevent
repeated calls to defaultIndexedDirectory() and defaultBlacklistPaths()
functions within frequently called path checking methods. This
optimization reduces computational overhead while maintaining the same
functionality.

Influence:
1. Verify path checking still works correctly in all cases
2. Test performance impact by comparing search operations before and
after
3. Ensure blacklist functionality remains effective
4. Check edge cases with different path combinations

refactor: 优化搜索工具路径检查函数

1. 重构 isPathInContentIndexDirectory 使用静态缓存的目录列表
2. 以相同方式重构 isPathInOcrTextIndexDirectory 减少重复调用
3. 优化 isPathInFileNameIndexDirectory 使用缓存的列表同时处理黑名单和索
引目录
4. 通过避免重复调用目录获取函数来提升性能

这些修改将目录列表缓存为静态变量,避免在频繁调用的路径检查方法中重复调用
defaultIndexedDirectory() 和 defaultBlacklistPaths() 函数。这种优化在保
持相同功能的同时减少了计算开销。

Influence:
1. 验证路径检查在所有情况下仍然正常工作
2. 通过比较优化前后的搜索操作测试性能影响
3. 确保黑名单功能保持有效
4. 检查不同路径组合的边缘情况
@github-actions
Copy link
Copy Markdown

TAG Bot

TAG: 1.3.52
EXISTED: no
DISTRIBUTION: unstable

@github-actions
Copy link
Copy Markdown

  • 检测到debian目录文件有变更: debian/libdfm6-search.install

1. Replaced direct member variables with a private implementation class
(TimeRangeFilterData)
2. Modified constructors, destructor and assignment operators to handle
the pimpl pointer
3. All member access now goes through the d pointer
4. Moved internal RangeMode enum to implementation file
5. Added proper memory management with unique_ptr

Reason for changes:
1. Provides better encapsulation by hiding implementation details
2. Reduces header file dependencies and compilation time
3. Makes ABI more stable since private members aren't exposed
4. Follows better OOP design principles
5. Easier to modify implementation without affecting clients

Influence:
1. Verify all time range filtering functionality works as before
2. Test copy/move operations between TimeRangeFilter instances
3. Check memory usage and leaks when creating/destroying filters
4. Verify range calculations still produce correct results
5. Test with different time fields (modify/created time etc.)

refactor: 为TimeRangeFilter类实现Pimpl模式

1. 将直接成员变量替换为私有实现类(TimeRangeFilterData)
2. 修改构造函数、析构函数和赋值运算符以处理pimpl指针
3. 所有成员访问现在通过d指针进行
4. 将内部RangeMode枚举移到实现文件
5. 添加了使用unique_ptr的适当内存管理

更改原因:
1. 通过隐藏实现细节提供更好的封装
2. 减少头文件依赖和编译时间
3. 因为不暴露私有成员使ABI更稳定
4. 遵循更好的OOP设计原则
5. 更容易修改实现而不影响客户端

Influence:
1. 验证所有时间范围筛选功能是否像以前一样工作
2. 测试TimeRangeFilter实例之间的复制/移动操作
3. 检查创建/销毁过滤器时的内存使用和泄漏情况
4. 验证范围计算仍能产生正确结果
5. 使用不同时间字段(修改时间/创建时间等)进行测试
@deepin-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Johnson-zs

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

  • 检测到debian目录文件有变更: debian/libdfm6-search.install

@github-actions
Copy link
Copy Markdown

  • 检测到debian目录文件有变更: debian/control,debian/libdfm6-search.install

update SPDX header

Log:
@github-actions
Copy link
Copy Markdown

  • 检测到debian目录文件有变更: debian/control,debian/libdfm6-search.install

@deepin-ci-robot
Copy link
Copy Markdown

deepin pr auto review

我已对代码进行了全面的审查。以下是关于语法逻辑、代码质量、代码性能和代码安全的详细审查意见和改进建议:

1. 语法逻辑审查

1.1 时间范围过滤逻辑

问题:在 TimeRangeFilter::resolveRelativeTimeRange 方法中,对于"Last N days"的处理可能存在边界问题。

// 当前实现
case TimeUnit::Days:
    start = QDateTime(now.date().addDays(-value), QTime(0, 0, 0));
    break;

建议

  • 明确文档说明"Last 3 days"是指从3天前的00:00:00到现在,还是从72小时前到现在
  • 对于小时和分钟级别的相对时间,建议使用精确计算而非对齐到天开始时间

1.2 OCR搜索的混合AND查询逻辑

问题:在 OcrTextIndexedStrategy::buildAdvancedAndQuery 中,混合AND查询的逻辑较为复杂,可能存在性能问题。

// 当前实现
Lucene::BooleanQueryPtr pureFilenameQuery = newLucene<Lucene::BooleanQuery>();
pureFilenameQuery->add(allFilenamesQuery, Lucene::BooleanClause::MUST);
pureFilenameQuery->add(allOcrContentsQuery, Lucene::BooleanClause::MUST_NOT);

建议

  • 考虑使用Lucene的SpanQueryPhraseQuery来优化这种复杂的布尔查询
  • 添加性能测试,特别是对于大量文档的情况

1.3 时间解析逻辑

问题:在 TimeParser::parseTimeRange 中,时间解析的容错性不足。

// 当前实现
start = QDateTime::fromString(startStr, "yyyy-MM-dd HH:mm");
if (!start.isValid()) {
    start = QDateTime::fromString(startStr, "yyyy-MM-dd");

建议

  • 添加更多时间格式的支持,如"yyyy/MM/dd"、"MM/dd/yyyy"等
  • 添加对相对时间的支持,如"today"、"yesterday"
  • 考虑使用QDateTime::fromString的重载版本,支持多种格式

2. 代码质量审查

2.1 命名规范

问题:部分变量和方法的命名不够清晰。

// 示例
auto [start, end] = filter.resolveTimeRange();

建议

  • 使用更具描述性的变量名,如startTimeendTime
  • 对于返回多个值的情况,考虑使用结构体而非结构化绑定,提高可读性

2.2 错误处理

问题:部分异常处理不够完善。

// 在OcrTextIndexedStrategy::performOcrTextSearch中
try {
    doc = searcher->doc(scoreDoc->doc);
    if (!doc) {
        qWarning() << "Failed to retrieve document at index:" << scoreDoc->doc;
        continue;
    }
} catch (const Lucene::LuceneException &e) {
    qWarning() << "Exception while retrieving document:" << QString::fromStdWString(e.getError());
    continue;
}

建议

  • 添加更详细的错误日志,包括堆栈跟踪
  • 考虑实现错误计数器,当错误超过阈值时终止搜索
  • 对于关键错误,考虑抛出异常而非静默跳过

2.3 代码重复

问题:在多个ResultAPI类中存在重复的时间处理代码。

// 在ContentResultAPI、FileNameResultAPI、OcrTextResultAPI中都有类似代码
qint64 modifyTs = resultAPI.modifyTimestamp();
if (modifyTs > 0) {
    QJsonObject modifyTimeObj;
    modifyTimeObj["timestamp"] = modifyTs;
    modifyTimeObj["formatted"] = resultAPI.modifyTimeString();
    obj["modifyTime"] = modifyTimeObj;
}

建议

  • 将时间处理逻辑提取到TimeResultAPI基类中
  • 使用模板方法模式,减少重复代码

3. 代码性能审查

3.1 字符串处理

问题:在多个地方存在不必要的字符串转换和拷贝。

// 示例
QString path = QString::fromStdWString(pathField);
if (!path.startsWith(searchPath)) {
    continue;
}

建议

  • 考虑使用QStringView来避免不必要的字符串拷贝
  • 对于路径比较,考虑使用QFileInfo或QDir的规范化方法

3.2 查询优化

问题:在构建Lucene查询时,可能存在性能瓶颈。

// 在FileNameIndexedStrategy::buildLuceneQuery中
if (hasValidQuery && SearchUtility::isFilenameIndexAncestorPathsSupported()
    && SearchUtility::shouldUsePathPrefixQuery(searchPath)) {
    QueryPtr pathPrefixQuery = LuceneQueryUtils::buildPathPrefixQuery(searchPath,
                                                                      QString::fromWCharArray(LuceneFieldNames::FileName::kAncestorPaths));

建议

  • 考虑缓存路径前缀查询的结果
  • 对于频繁使用的路径前缀,预编译查询

3.3 结果处理

问题:在处理搜索结果时,可能存在性能问题。

// 在ContentIndexedStrategy::processSearchResults中
if (enableRetrieval) {
    try {
        Lucene::String contentField = doc->get(LuceneFieldNames::Content::kContents);
        if (!contentField.empty()) {
            const QString content = QString::fromStdWString(contentField);
            const QString highlightedContent = ContentHighlighter::customHighlight(m_keywords, content, previewLen, enableHTML);

建议

  • 考虑使用流式处理,避免一次性加载所有内容
  • 对于大型文档,实现分页或延迟加载

4. 代码安全审查

4.1 输入验证

问题:在解析命令行参数时,输入验证不够严格。

// 在CliOptions::parse中
QString typeStr = m_parser.value(m_typeOption);
if (typeStr == "content") {
    config.searchType = SearchType::Content;
} else if (typeStr == "ocr") {
    config.searchType = SearchType::Ocr;
} else if (typeStr != "filename") {
    std::cerr << "Error: Invalid search type. Use 'filename', 'content', or 'ocr'" << std::endl;
    return false;
}

建议

  • 添加对输入长度的限制,防止缓冲区溢出
  • 对特殊字符进行转义或过滤,防止注入攻击
  • 使用白名单而非黑名单验证输入

4.2 路径遍历

问题:在处理文件路径时,可能存在路径遍历漏洞。

// 在多个地方
QString path = QString::fromStdWString(doc->get(LuceneFieldNames::Content::kPath));
if (!path.startsWith(searchPath)) {
    continue;
}

建议

  • 使用QFileInfo::canonicalPath()规范化路径
  • 检查路径是否包含".."或符号链接
  • 限制搜索路径在允许的目录范围内

4.3 时间戳处理

问题:在处理时间戳时,可能存在整数溢出或精度丢失。

// 在多个地方
qint64 timestamp = QString::fromStdWString(modifyTimeField).toLongLong(&ok);
if (ok && timestamp > 0) {
    resultApi.setModifyTimestamp(timestamp);
}

建议

  • 添加对时间戳范围的验证
  • 考虑使用QDateTime::fromSecsSinceEpoch的安全版本
  • 添加对无效时间戳的处理逻辑

5. 具体改进建议

5.1 添加单元测试

建议

  • 为TimeRangeFilter添加更全面的单元测试,特别是边界条件
  • 为时间解析器添加测试用例,覆盖各种时间格式
  • 为OCR搜索的混合AND查询添加性能测试

5.2 改进错误处理

建议

  • 实现统一的错误处理机制
  • 添加错误代码和错误消息的国际化支持
  • 考虑实现错误恢复机制

5.3 优化性能

建议

  • 实现查询缓存机制
  • 添加性能监控和日志记录
  • 考虑使用异步处理提高响应速度

5.4 增强安全性

建议

  • 实现输入验证框架
  • 添加安全审计日志
  • 考虑实现权限检查机制

6. 代码示例改进

6.1 改进时间范围解析

// 改进后的TimeParser::parseTimeRange
bool TimeParser::parseTimeRange(const QString &arg, QDateTime &start, QDateTime &end)
{
    if (arg.isEmpty()) {
        return false;
    }

    QStringList parts = arg.split(',');
    if (parts.size() != 2) {
        return false;
    }

    QString startStr = parts[0].trimmed();
    QString endStr = parts[1].trimmed();

    // 支持多种时间格式
    QList<QString> formats = {
        "yyyy-MM-dd HH:mm:ss",
        "yyyy-MM-dd HH:mm",
        "yyyy-MM-dd",
        "yyyy/MM/dd",
        "MM/dd/yyyy"
    };

    // 解析开始时间
    for (const QString &format : formats) {
        start = QDateTime::fromString(startStr, format);
        if (start.isValid()) {
            if (format == "yyyy-MM-dd") {
                start.setTime(QTime(0, 0, 0));
            }
            break;
        }
    }

    // 解析结束时间
    for (const QString &format : formats) {
        end = QDateTime::fromString(endStr, format);
        if (end.isValid()) {
            if (format == "yyyy-MM-dd") {
                end.setTime(QTime(23, 59, 59));
            }
            break;
        }
    }

    // 验证时间范围
    if (!start.isValid() || !end.isValid() || start > end) {
        return false;
    }

    return true;
}

6.2 改进错误处理

// 改进后的OcrTextIndexedStrategy::performOcrTextSearch
void OcrTextIndexedStrategy::performOcrTextSearch(const SearchQuery &query)
{
    SearchCancellationGuard guard(&m_cancelled);
    int errorCount = 0;
    const int MAX_ERRORS = 100; // 最大允许错误数

    try {
        // ... 现有代码 ...

        for (int32_t i = 0; i < docsSize; ++i) {
            if (m_cancelled.load()) {
                qInfo() << "OCR text search cancelled";
                break;
            }

            try {
                // ... 处理文档 ...
            } catch (const Lucene::LuceneException &e) {
                errorCount++;
                qWarning() << "Error processing document" << i << ":" 
                          << QString::fromStdWString(e.getError());
                
                if (errorCount >= MAX_ERRORS) {
                    qCritical() << "Too many errors (" << errorCount 
                               << "), aborting search";
                    emit errorOccurred(SearchError(OcrTextSearchErrorCode::OcrTextIndexException));
                    return;
                }
                continue;
            }
        }

        if (errorCount > 0) {
            qWarning() << "Search completed with" << errorCount << "errors";
        }

        // ... 现有代码 ...
    } catch (...) {
        qCritical() << "Unexpected exception in OCR text search";
        emit errorOccurred(SearchError(OcrTextSearchErrorCode::OcrTextIndexException));
    }
}

6.3 改进路径验证

// 新增的路径验证工具类
class PathValidator
{
public:
    static bool isValidSearchPath(const QString &path, const QString &allowedBasePath)
    {
        if (path.isEmpty() || allowedBasePath.isEmpty()) {
            return false;
        }

        QFileInfo pathInfo(path);
        if (!pathInfo.exists() || !pathInfo.isDir()) {
            return false;
        }

        // 规范化路径
        QString canonicalPath = pathInfo.canonicalFilePath();
        QString canonicalBasePath = QFileInfo(allowedBasePath).canonicalFilePath();

        // 检查是否在允许的路径范围内
        if (!canonicalPath.startsWith(canonicalBasePath)) {
            return false;
        }

        // 检查路径遍历
        if (canonicalPath.contains("..")) {
            return false;
        }

        return true;
    }

    static bool isPathSafe(const QString &path)
    {
        // 检查路径中是否包含危险字符
        if (path.contains('\0') || path.contains('\n') || path.contains('\r')) {
            return false;
        }

        // 检查路径长度
        if (path.length() > PATH_MAX) {
            return false;
        }

        return true;
    }
};

总结

总体而言,代码结构良好,遵循了面向对象的设计原则,但在以下几个方面需要改进:

  1. 时间处理逻辑:需要更明确的时间范围定义和更灵活的时间解析
  2. 错误处理:需要更完善的错误处理和恢复机制
  3. 性能优化:需要考虑查询缓存、结果分页等优化措施
  4. 安全性:需要加强输入验证和路径安全检查
  5. 代码质量:需要减少代码重复,提高可维护性

建议优先处理安全性和错误处理相关的问题,然后逐步优化性能和代码质量。同时,建议添加更全面的单元测试和集成测试,确保代码的稳定性和可靠性。

@github-actions
Copy link
Copy Markdown

  • 敏感词检查失败, 检测到1个文件存在敏感词
详情
{
    "debian/control": [
        {
            "line": "Homepage: http://www.deepin.org",
            "line_number": 32,
            "rule": "S35",
            "reason": "Url link | 6fe814dfb7"
        }
    ]
}

@Johnson-zs
Copy link
Copy Markdown
Contributor Author

/forcemerge

@deepin-bot
Copy link
Copy Markdown

deepin-bot Bot commented Apr 20, 2026

This pr force merged! (status: blocked)

@deepin-bot deepin-bot Bot merged commit dfabe87 into linuxdeepin:master Apr 20, 2026
20 of 22 checks passed
@deepin-bot
Copy link
Copy Markdown

deepin-bot Bot commented Apr 20, 2026

TAG Bot

Tag created successfully

📋 Tag Details
  • Tag Name: 1.3.52
  • Tag SHA: 89b79b0b2b61367c735ffd5abfd9dadfe8f63bc4
  • Commit SHA: 4dc6308ef88e376afbb07abf143aec663192a316
  • Tag Message:
    Release util-dfm 1.3.52
    
    
  • Tagger:
    • Name: Johnson-zs
  • Distribution: unstable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants