The goal is to make an alternative to FTS5.
- Ideally you point it at a FTS5 content table
- Words are extracted and automatic vectors for each word are determined - ie no online lookups and vectors are determined unsupervised
- Vector per word stored in that or another database. They are needed for both ingest and query
- Existing fts5 tokenizers can be used such as html, json, unicodewords
- Content can also be broken down into sentences - we have unicode sentence algorithm and guess paragraphs
- It looks like the average of the vectors of each word in a sentence is used as the vector for a sentence
- A search should not only find matching documents, but should also find the best sentences in the document
The testing should be done with SQLite HTML docs, the recipe database, and the enron emails.
The goal is to make an alternative to FTS5.
The testing should be done with SQLite HTML docs, the recipe database, and the enron emails.