Skip to content

Investigate vector based text search #608

@rogerbinns

Description

@rogerbinns

The goal is to make an alternative to FTS5.

  • Ideally you point it at a FTS5 content table
  • Words are extracted and automatic vectors for each word are determined - ie no online lookups and vectors are determined unsupervised
  • Vector per word stored in that or another database. They are needed for both ingest and query
  • Existing fts5 tokenizers can be used such as html, json, unicodewords
  • Content can also be broken down into sentences - we have unicode sentence algorithm and guess paragraphs
  • It looks like the average of the vectors of each word in a sentence is used as the vector for a sentence
  • A search should not only find matching documents, but should also find the best sentences in the document

The testing should be done with SQLite HTML docs, the recipe database, and the enron emails.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions