H-3528, H-3529: Set up Pdfium & preprocess PDFs as images#5512
H-3528, H-3529: Set up Pdfium & preprocess PDFs as images#5512TimDiekmann merged 32 commits intohashintel:mainfrom
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5512 +/- ##
=======================================
Coverage 19.83% 19.83%
=======================================
Files 515 515
Lines 17327 17327
Branches 2548 2548
=======================================
Hits 3437 3437
Misses 13852 13852
Partials 38 38
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
TimDiekmann
left a comment
There was a problem hiding this comment.
Thanks @JesusFileto!
I had a look through the PR and so far it really looks good! I have a few minor suggestions, but really nothing critical.
indietyp
left a comment
There was a problem hiding this comment.
Hi! 👋 I am a contributor to HASH working on a couple of things and got curious when I saw this PR, so I thought I'd leave some comments. Don't feel any pressure to implement any of these, just hoping you'll find these helpful in some way! 😊
TimDiekmann
left a comment
There was a problem hiding this comment.
I have a minor suggestion on where to put snapshots and added comments to the .github/ files I changed. Having arbitrary folders in src/ might be misleading as typically, every directory inside of src is an actual module.
I think we can safe bigger refactoring such as moving Pdfium to a struct as we discussed offline for a later PR to get this PR over the line.
There was a problem hiding this comment.
I added a step for the test-workflow to download the .so file to link dynamically.
🌟 What is the purpose of this PR?
This intial PR serves as the first small step in the segmenting, chunking, and embedding package Chonky. Currently, we are setting up the environment to receive a file path to a pdf, and preprocess the pdf into images using
Pdfium-renderthat will be used for structural embeddings later on.This PR also sets up the
Pdfium-renderto be used for text extraction.🔗 Related links
(internal)
Implementation Doc
🚫 Blocked by
N/A
🔍 What does this change?
Pre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
This PR:
📜 Does this require a change to the docs?
The changes in this PR:
🕸️ Does this require a change to the Turbo Graph?
The changes in this PR:
error-stackfor error handling🐾 Next steps
pdfium-renderand enriching schema with document layout informationclapfor parsing CLI arguments🛡 What tests cover this?
❓ How to test this?
cargo run tests/docs/test-doc.pdfwill save the pdf images to theoutdirection in the Chonky package📹 Demo