Skip to content

H-3528, H-3529: Set up Pdfium & preprocess PDFs as images#5512

Merged
TimDiekmann merged 32 commits intohashintel:mainfrom
JesusFileto:chonky
Nov 16, 2024
Merged

H-3528, H-3529: Set up Pdfium & preprocess PDFs as images#5512
TimDiekmann merged 32 commits intohashintel:mainfrom
JesusFileto:chonky

Conversation

@JesusFileto
Copy link
Copy Markdown
Contributor

@JesusFileto JesusFileto commented Oct 29, 2024

🌟 What is the purpose of this PR?

This intial PR serves as the first small step in the segmenting, chunking, and embedding package Chonky. Currently, we are setting up the environment to receive a file path to a pdf, and preprocess the pdf into images using Pdfium-render that will be used for structural embeddings later on.

This PR also sets up the Pdfium-render to be used for text extraction.

🔗 Related links

(internal)
Implementation Doc

🚫 Blocked by

N/A

🔍 What does this change?

  • Loading Pdfs given a local file path
  • Convert Pdfs to page by page images

Pre-Merge Checklist 🚀

🚢 Has this modified a publishable library?

This PR:

  • modifies a Cargo-publishable library, but it is not yet ready to publish

📜 Does this require a change to the docs?

The changes in this PR:

  • are internal and do not require a docs change

🕸️ Does this require a change to the Turbo Graph?

The changes in this PR:

  • do not affect the execution graph

⚠️ Known issues

  • Need for proper implementation and use of error-stack for error handling

🐾 Next steps

  • Extracting textual information from pdfs using pdfium-render and enriching schema with document layout information
  • Implement clap for parsing CLI arguments

🛡 What tests cover this?

  • New tests were created

❓ How to test this?

  • running the command cargo run tests/docs/test-doc.pdf will save the pdf images to the out direction in the Chonky package

📹 Demo

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Oct 29, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added area/libs Relates to first-party libraries/crates/packages (area) area/tests New or updated tests area/libs > chonky Affects the `chonky` crate (library) labels Oct 29, 2024
Comment thread libs/chonky/src/lib.rs Fixed
Comment thread libs/chonky/src/lib.rs Fixed
@TimDiekmann TimDiekmann self-requested a review October 29, 2024 07:43
@vilkinsons vilkinsons changed the title Implement Preprocessing of Pdfs Into Images + Setup Pdfium H-3528, H-3529: Set up Pdfium & preprocess PDFs as images Oct 29, 2024
@vilkinsons vilkinsons requested a review from CiaranMn October 29, 2024 09:14
@github-actions github-actions bot added the area/deps Relates to third-party dependencies (area) label Oct 29, 2024
Comment thread libs/chonky/src/lib.rs Fixed
Comment thread libs/chonky/src/lib.rs Fixed
Comment thread libs/chonky/src/lib.rs Fixed
Comment thread libs/chonky/src/lib.rs Fixed
Comment thread libs/chonky/src/lib.rs Fixed
Comment thread libs/chonky/src/main.rs Fixed
Comment thread libs/chonky/src/main.rs Fixed
Comment thread libs/chonky/src/main.rs Fixed
Comment thread libs/chonky/src/main.rs Fixed
Comment thread libs/chonky/src/main.rs Fixed
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 19.83%. Comparing base (24f2260) to head (3e5647c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5512   +/-   ##
=======================================
  Coverage   19.83%   19.83%           
=======================================
  Files         515      515           
  Lines       17327    17327           
  Branches     2548     2548           
=======================================
  Hits         3437     3437           
  Misses      13852    13852           
  Partials       38       38           
Flag Coverage Δ
apps.hash-ai-worker-ts 1.38% <ø> (ø)
apps.hash-api 1.17% <ø> (ø)
blockprotocol.type-system 47.40% <ø> (ø)
local.hash-backend-utils 8.80% <ø> (ø)
local.hash-graph-sdk 100.00% <ø> (ø)
local.hash-isomorphic-utils 1.05% <ø> (ø)
local.hash-subgraph 24.54% <ø> (ø)
rust.deer 6.66% <ø> (ø)
rust.error-stack 72.51% <ø> (ø)
rust.sarif 87.66% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown
Member

@TimDiekmann TimDiekmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JesusFileto!

I had a look through the PR and so far it really looks good! I have a few minor suggestions, but really nothing critical.

Comment thread libs/chonky/Cargo.toml Outdated
Comment thread libs/chonky/src/lib.rs Outdated
Comment thread libs/chonky/src/lib.rs Outdated
Comment thread libs/chonky/src/lib.rs Outdated
Comment thread libs/chonky/src/main.rs Outdated
Comment thread libs/chonky/src/lib.rs Fixed
Comment thread libs/chonky/src/lib.rs Fixed
Comment thread libs/chonky/src/lib.rs Fixed
Comment thread libs/chonky/src/lib.rs Fixed
Comment thread libs/chonky/src/lib.rs Outdated
Comment thread libs/chonky/src/lib.rs
Comment thread libs/chonky/src/lib.rs Outdated
Comment thread Cargo.toml Outdated
Comment thread Cargo.toml Outdated
Comment thread libs/chonky/src/lib.rs Outdated
Comment thread libs/chonky/libs/libpdfium.dylib Outdated
Copy link
Copy Markdown
Member

@indietyp indietyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! 👋 I am a contributor to HASH working on a couple of things and got curious when I saw this PR, so I thought I'd leave some comments. Don't feel any pressure to implement any of these, just hoping you'll find these helpful in some way! 😊

Comment thread Cargo.toml Outdated
Comment thread libs/chonky/libs/libpdfium.dylib Outdated
Comment thread libs/chonky/src/lib.rs Outdated
Comment thread libs/chonky/tests/docs/test-doc.pdf
Comment thread libs/chonky/src/lib.rs Outdated
Comment thread libs/chonky/src/lib.rs Outdated
Copy link
Copy Markdown
Member

@TimDiekmann TimDiekmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor suggestion on where to put snapshots and added comments to the .github/ files I changed. Having arbitrary folders in src/ might be misleading as typically, every directory inside of src is an actual module.

I think we can safe bigger refactoring such as moving Pdfium to a struct as we discussed offline for a later PR to get this PR over the line.

Comment thread .github/actions/build-docker-images/action.yml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a step for the test-workflow to download the .so file to link dynamically.

Comment thread libs/chonky/src/snapshots/chonky__tests__page_1.snap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/deps Relates to third-party dependencies (area) area/infra Relates to version control, CI, CD or IaC (area) area/libs > chonky Affects the `chonky` crate (library) area/libs Relates to first-party libraries/crates/packages (area) area/tests New or updated tests

Development

Successfully merging this pull request may close these issues.

7 participants