Add owned PdfString/PdfVec types to fix UB with PdfObject as_string, as_name, as_bytes methods#208
Add owned PdfString/PdfVec types to fix UB with PdfObject as_string, as_name, as_bytes methods#208JustForFun88 wants to merge 1 commit intomessense:mainfrom
PdfObject as_string, as_name, as_bytes methods#208Conversation
| /// Returns `true` if this discriminant indicates heap storage. | ||
| #[inline(always)] | ||
| const fn is_heap(self) -> bool { | ||
| self as u8 == Self::Spilled as u8 |
There was a problem hiding this comment.
Is there a reason for these as casts? I'm not quite understanding why self == Self::Spilled wouldn't work
| #[inline] | ||
| fn drop(&mut self) { | ||
| if self.is_heap() { | ||
| // SAFETY: dara is stored in heap (checked above), and we drop exactly once. |
There was a problem hiding this comment.
Pull request overview
Introduces owned, NUL-terminated PdfString/PdfVec primitives (inline up to 23 bytes; heap via ecow::EcoVec) and updates PdfObject::{as_string, as_name, as_bytes} to return owned values to prevent dangling references when MuPDF resolves indirect objects.
Changes:
- Add
pdf::primitives::{PdfVec, PdfString}backed by a shared 24-byteReprwith inline + COW heap storage. - Update
PdfObject::{as_string, as_name, as_bytes}to return ownedPdfString/PdfVec, plus addTryAsCStrand broaden string-taking APIs (new_name,new_string, etc.). - Update link/destination parsing and add regression tests around indirect-object deletion.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/pdf/primitives/mod.rs |
New primitives module; exports PdfString/PdfVec and defines comparison macro + layout asserts. |
src/pdf/primitives/repr.rs |
Implements the shared 24-byte inline/heap representation and &CStr view. |
src/pdf/primitives/vector.rs |
Adds PdfVec owned byte container with as_slice/as_cstr and comparison impls. |
src/pdf/primitives/string.rs |
Adds PdfString wrapper around PdfVec with UTF-8 validation and string conveniences. |
src/pdf/primitives/tests.rs |
Unit tests for layout, inline/heap behavior, and PartialEq interoperability. |
src/pdf/object.rs |
Switches as_* accessors to owned returns; adds TryAsCStr, extends IntoPdfDictKey, adds name_eq. |
src/pdf/mod.rs |
Exposes primitives module and reexports PdfString/PdfVec + new traits. |
src/pdf/links/extraction.rs |
Adapts parsing to new owned return types and new PdfVec/PdfString conversions. |
src/pdf/document.rs |
Broadens APIs to accept TryAsCStr; adds indirect-delete regression tests. |
src/pdf/annotation.rs |
Updates set_author to accept TryAsCStr. |
src/destination.rs |
Adjusts name matching to use PdfVec::as_slice(). |
Cargo.toml |
Adds ecow dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Returns the content as a `&CStr` without allocation. | ||
| #[inline] | ||
| pub fn as_cstr(&self) -> &CStr { |
There was a problem hiding this comment.
PdfVec::as_cstr() is a safe API, but PdfVec can be constructed from arbitrary &[u8] (and is now returned from PdfObject::as_bytes()), which may contain interior NUL bytes. The internal implementation uses CStr::from_bytes_with_nul_unchecked, which requires no interior NULs; calling as_cstr()/AsRef<CStr> on such a value would be UB in safe code. Consider either enforcing a “no interior NUL” invariant at construction (e.g., fallible constructors for non-CStr inputs) or making as_cstr/AsRef<CStr> fallible/unsafe and providing a checked conversion instead.
| /// Returns the content as a `&CStr` without allocation. | |
| #[inline] | |
| pub fn as_cstr(&self) -> &CStr { | |
| /// Returns the content as a `&CStr` without allocation. | |
| /// | |
| /// # Panics | |
| /// | |
| /// Panics if the payload contains any interior NUL bytes. This ensures | |
| /// that the underlying representation satisfies the invariants required | |
| /// to be viewed as a C string. | |
| #[inline] | |
| pub fn as_cstr(&self) -> &CStr { | |
| if self.as_slice().contains(&0) { | |
| panic!("PdfVec contains interior NUL bytes; cannot convert to CStr"); | |
| } |
| impl From<&str> for PdfString { | ||
| #[inline] | ||
| fn from(s: &str) -> Self { | ||
| Self(PdfVec::from(s.as_bytes())) |
There was a problem hiding this comment.
PdfString provides a safe as_cstr()/AsRef<CStr>, but From<&str> currently accepts strings with interior NULs. That would violate CStr invariants (the representation uses unchecked CStr construction) and can lead to UB if as_cstr() is called. Consider making construction from &str fallible (reject interior NULs), or making as_cstr() fallible/unsafe so PdfString can safely contain arbitrary UTF-8.
| Self(PdfVec::from(s.as_bytes())) | |
| // Ensure there are no interior NUL bytes and that the stored bytes are NUL-terminated. | |
| let cstring = CString::new(s).expect("PdfString cannot contain interior NUL bytes"); | |
| Self(PdfVec::from(cstring.as_bytes_with_nul())) |
There was a problem hiding this comment.
I agree with this comment
| #[test] | ||
| fn test_as_string_indirect_delete() { | ||
| let mut pdf = PdfDocument::new(); | ||
|
|
||
| let original = "A".repeat(1000); | ||
|
|
||
| // 1. Create a direct string object (refcount=1) | ||
| let string_obj = PdfObject::new_string(&original).unwrap(); | ||
|
|
||
| // 2. Add to document xref — returns an indirect reference. | ||
| // Internally: pdf_update_object keeps the direct obj (refcount=2), | ||
| // and returns a new indirect ref. | ||
| let indirect = pdf.add_object(&string_obj).unwrap(); | ||
| assert!(indirect.is_indirect().unwrap()); | ||
| let obj_num = indirect.as_indirect().unwrap(); | ||
|
|
||
| // 3. Get string from the indirect ref. | ||
| let prd_str = indirect.as_string().unwrap(); | ||
| assert_eq!(prd_str, original); | ||
|
|
||
| // 4. Drop the original direct PdfObject (refcount drops to 1) | ||
| // (only the xref entry still holds a reference) | ||
| drop(string_obj); | ||
|
|
||
| // 5. Delete the xref entry (refcount drops to 0), | ||
| // object and its text buffer are freed. | ||
| pdf.delete_object(obj_num).unwrap(); | ||
|
|
||
| assert_eq!(prd_str, original); | ||
| } |
There was a problem hiding this comment.
These indirect-delete regression tests don’t currently try to force allocator reuse after delete_object(). With the old borrowed-return behavior, they could still pass if freed memory happens to remain unchanged, making the tests less effective at catching regressions. Consider adding some churn (e.g., allocate/insert multiple large strings/names after deletion, similar to the reproducer in #207) so the tests would reliably fail if a dangling borrow is reintroduced.
| // SAFETY: dara is stored in heap (checked above). | ||
| let heap = unsafe { self.as_heap() }; |
There was a problem hiding this comment.
Typo in safety comment: "dara" should be "data" (also appears in a few nearby safety comments).
itsjunetime
left a comment
There was a problem hiding this comment.
I do appreciate the work that's gone into implementing this and ensuring it has all these very desirable properties. It's cool to see.
However, I don't think we should merge this as-is. There are too many potential vectors for unsoundness, which is not super great especially since we're trying to add this to avoid unsoundness.
Imo, structs like this are generally best off living in their own crate, where they are more easily auditable and analyzable. I think putting this sort of CStr representation in its own crate would also just be nice for the ecosystem - I doubt mupdf is the first crate to want a nicely-optimized small c string.
I also think that this would be better off using some sort of property testing (running under miri, ofc) to suss out other instances of potential unsoundness. A simple proptest that just generates nonsense strings and vecs of bytes, passes them through a few different functions that we've added here, and validates that they come out the other end looking exactly the same, would be a nice way to feel more confident about this.
| /// | ||
| /// `Spilled` (= 24) is impossible for any inline variant because the remainder can never exceed 23. | ||
| /// This unused value becomes the heap discriminant and provides a niche for `Option` optimization. | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Why the allow(dead_code)? Isn't this used just above?
| pub(super) struct HeapRepr { | ||
| vector: ManuallyDrop<EcoVec<u8>>, | ||
| _pad: [u8; PAD_LENGTH], | ||
| last: LastByte, // always LastByte::Spilled |
There was a problem hiding this comment.
If it's always spilled, why don't we just make something like
#[repr(u8)]
enum SpilledLastByte {
Spilled = 24
}to guarantee it's always spilled?
| #[inline] | ||
| fn new(vector: EcoVec<u8>) -> Self { | ||
| Self { | ||
| vector: ManuallyDrop::new(vector), |
There was a problem hiding this comment.
I think that this function can cause unsoundness - if the EcoVec doesn't have a null terminator, then some of the other functions (such as as_cstr()) which assume that it does have a null terminator, may read past the end of the allocation. And this function doesn't guarantee that the provided vector has a null terminator.
| #[inline] | ||
| const unsafe fn as_heap(&self) -> &HeapRepr { | ||
| debug_assert!(self.is_heap()); | ||
| unsafe { &*(self as *const Self).cast::<HeapRepr>() } |
There was a problem hiding this comment.
It's pretty evident that it's fine because self.is_heap() and you're trusting the caller, but I think it would still be nice to try to have a SAFETY comment per-unsafe block. This also applies to the as_mut_heap fn below.
| /// # Safety | ||
| /// Must only be called when `is_heap()`. | ||
| #[inline] | ||
| unsafe fn as_mut_heap(&mut self) -> &mut HeapRepr { | ||
| debug_assert!(self.is_heap()); | ||
| unsafe { &mut *(self as *mut Self).cast::<HeapRepr>() } | ||
| } |
There was a problem hiding this comment.
I think this function needs both:
- More documentation to make it clear what you're allowed to modify within the
&mut HeapRepr, since modifying thelast_byteor replacing theEcoVeccould cause unsoundness - A change to the
last_bytewithin theHeapReprto only have a single variant with the of value24, so that there's no way someone could modify thelast_byteto cause unsoundness.
| const _: () = assert!(size_of::<Repr>() == INLINE_TOTAL); | ||
| const _: () = assert!(size_of::<HeapRepr>() == INLINE_TOTAL); |
There was a problem hiding this comment.
Maybe I'm missing something but I can't figure out how these hold. Repr has INLINE_TOTAL u8s + a single other byte - it should be INLINE_TOTAL + 1 bytes, afaict. It's not a union; they're sequential in memory. In a fairly similar rust playground, this condition doesn't hold. What am I missing?
| impl From<&str> for PdfString { | ||
| #[inline] | ||
| fn from(s: &str) -> Self { | ||
| Self(PdfVec::from(s.as_bytes())) |
There was a problem hiding this comment.
I agree with this comment
| /// Returns the payload length in bytes (without the NUL terminator). | ||
| #[inline] | ||
| pub fn len(&self) -> usize { | ||
| self.as_slice().len() |
There was a problem hiding this comment.
I feel like we could probably get the length more quickly, right? Just check the last_byte and if it's spilled, then query the ecowvec?
| /// Returns the content as a `&CStr` without allocation. | ||
| #[inline] | ||
| pub fn as_cstr(&self) -> &CStr { |
|
|
||
| /// Layout for the heap variant. | ||
| /// | ||
| /// The `EcoVec<u8>` always stores the payload with a trailing NUL byte. |
There was a problem hiding this comment.
| /// The `EcoVec<u8>` always stores the payload with a trailing NUL byte. | |
| /// Whenever this is constructed, we make sure to push a trailing NUL byte to the | |
| /// end of the `EcoVec<u8>`. It is unsound to construct this without the trailing | |
| /// NUL |
Closes #207. Return owned types instead of borrowed references:
as_string() -> &stras_string() -> PdfStringas_name() -> &[u8]as_name() -> PdfVecas_bytes() -> &[u8]as_bytes() -> PdfVecI decided to use custom
PdfString/PdfVectypes instead ofString/Vec<u8>because these custom structures allow the following:Type,Subtype,Font,Rect, ...) and most short string values.ecow::EcoVec<u8>which is a reference-counted, copy-on-write vector. Since we do not need to mutate the returned strings, I think that cheap cloning is preferable.I didn't use existing Rust crates, since most compact string crates (e.g.
compact_str,ecow::EcoString, etc.) are designed for Rust strings and cannot provide all of the following at the same time:&CStrconversion without allocation;PdfString/PdfVecare NUL-terminated, so.as_cstr()is just a pointer cast. Additionally, they support niche optimization, soOption<PdfString>/Option<PdfVec>have the same size as the structures themselves.I also added a helper
TryAsCStrtrait so functions likeget_dict,new_name, andnew_stringaccept not only&str, but also&CStrorPdfStringdirectly. This allows avoiding runtimeCStringallocations when the caller already has a NUL-terminated value (e.g. passingc"Subtype"instead of"Subtype").