Skip to content

Add owned PdfString/PdfVec types to fix UB with PdfObject as_string, as_name, as_bytes methods#208

Open
JustForFun88 wants to merge 1 commit intomessense:mainfrom
JustForFun88:new_string_to_fix_ub
Open

Add owned PdfString/PdfVec types to fix UB with PdfObject as_string, as_name, as_bytes methods#208
JustForFun88 wants to merge 1 commit intomessense:mainfrom
JustForFun88:new_string_to_fix_ub

Conversation

@JustForFun88
Copy link
Copy Markdown
Contributor

Closes #207. Return owned types instead of borrowed references:

Before After
as_string() -> &str as_string() -> PdfString
as_name() -> &[u8] as_name() -> PdfVec
as_bytes() -> &[u8] as_bytes() -> PdfVec

I decided to use custom PdfString / PdfVec types instead of String / Vec<u8> because these custom structures allow the following:

  • Inline storage (up to 23 bytes) with no heap allocation, which covers 99% of standard PDF names (Type, Subtype, Font, Rect, ...) and most short string values.
  • Heap allocation uses ecow::EcoVec<u8> which is a reference-counted, copy-on-write vector. Since we do not need to mutate the returned strings, I think that cheap cloning is preferable.

I didn't use existing Rust crates, since most compact string crates (e.g. compact_str, ecow::EcoString, etc.) are designed for Rust strings and cannot provide all of the following at the same time:

  • Inline storage;
  • Cheap &CStr conversion without allocation;
  • Niche optimization.

PdfString / PdfVec are NUL-terminated, so .as_cstr() is just a pointer cast. Additionally, they support niche optimization, so Option<PdfString> / Option<PdfVec> have the same size as the structures themselves.

I also added a helper TryAsCStr trait so functions like get_dict, new_name, and new_string accept not only &str, but also &CStr or PdfString directly. This allows avoiding runtime CString allocations when the caller already has a NUL-terminated value (e.g. passing c"Subtype" instead of "Subtype").

/// Returns `true` if this discriminant indicates heap storage.
#[inline(always)]
const fn is_heap(self) -> bool {
self as u8 == Self::Spilled as u8
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for these as casts? I'm not quite understanding why self == Self::Spilled wouldn't work

#[inline]
fn drop(&mut self) {
if self.is_heap() {
// SAFETY: dara is stored in heap (checked above), and we drop exactly once.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces owned, NUL-terminated PdfString/PdfVec primitives (inline up to 23 bytes; heap via ecow::EcoVec) and updates PdfObject::{as_string, as_name, as_bytes} to return owned values to prevent dangling references when MuPDF resolves indirect objects.

Changes:

  • Add pdf::primitives::{PdfVec, PdfString} backed by a shared 24-byte Repr with inline + COW heap storage.
  • Update PdfObject::{as_string, as_name, as_bytes} to return owned PdfString/PdfVec, plus add TryAsCStr and broaden string-taking APIs (new_name, new_string, etc.).
  • Update link/destination parsing and add regression tests around indirect-object deletion.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/pdf/primitives/mod.rs New primitives module; exports PdfString/PdfVec and defines comparison macro + layout asserts.
src/pdf/primitives/repr.rs Implements the shared 24-byte inline/heap representation and &CStr view.
src/pdf/primitives/vector.rs Adds PdfVec owned byte container with as_slice/as_cstr and comparison impls.
src/pdf/primitives/string.rs Adds PdfString wrapper around PdfVec with UTF-8 validation and string conveniences.
src/pdf/primitives/tests.rs Unit tests for layout, inline/heap behavior, and PartialEq interoperability.
src/pdf/object.rs Switches as_* accessors to owned returns; adds TryAsCStr, extends IntoPdfDictKey, adds name_eq.
src/pdf/mod.rs Exposes primitives module and reexports PdfString/PdfVec + new traits.
src/pdf/links/extraction.rs Adapts parsing to new owned return types and new PdfVec/PdfString conversions.
src/pdf/document.rs Broadens APIs to accept TryAsCStr; adds indirect-delete regression tests.
src/pdf/annotation.rs Updates set_author to accept TryAsCStr.
src/destination.rs Adjusts name matching to use PdfVec::as_slice().
Cargo.toml Adds ecow dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +72 to +74
/// Returns the content as a `&CStr` without allocation.
#[inline]
pub fn as_cstr(&self) -> &CStr {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PdfVec::as_cstr() is a safe API, but PdfVec can be constructed from arbitrary &[u8] (and is now returned from PdfObject::as_bytes()), which may contain interior NUL bytes. The internal implementation uses CStr::from_bytes_with_nul_unchecked, which requires no interior NULs; calling as_cstr()/AsRef<CStr> on such a value would be UB in safe code. Consider either enforcing a “no interior NUL” invariant at construction (e.g., fallible constructors for non-CStr inputs) or making as_cstr/AsRef<CStr> fallible/unsafe and providing a checked conversion instead.

Suggested change
/// Returns the content as a `&CStr` without allocation.
#[inline]
pub fn as_cstr(&self) -> &CStr {
/// Returns the content as a `&CStr` without allocation.
///
/// # Panics
///
/// Panics if the payload contains any interior NUL bytes. This ensures
/// that the underlying representation satisfies the invariants required
/// to be viewed as a C string.
#[inline]
pub fn as_cstr(&self) -> &CStr {
if self.as_slice().contains(&0) {
panic!("PdfVec contains interior NUL bytes; cannot convert to CStr");
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this

impl From<&str> for PdfString {
#[inline]
fn from(s: &str) -> Self {
Self(PdfVec::from(s.as_bytes()))
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PdfString provides a safe as_cstr()/AsRef<CStr>, but From<&str> currently accepts strings with interior NULs. That would violate CStr invariants (the representation uses unchecked CStr construction) and can lead to UB if as_cstr() is called. Consider making construction from &str fallible (reject interior NULs), or making as_cstr() fallible/unsafe so PdfString can safely contain arbitrary UTF-8.

Suggested change
Self(PdfVec::from(s.as_bytes()))
// Ensure there are no interior NUL bytes and that the stored bytes are NUL-terminated.
let cstring = CString::new(s).expect("PdfString cannot contain interior NUL bytes");
Self(PdfVec::from(cstring.as_bytes_with_nul()))

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this comment

Comment thread src/pdf/document.rs
Comment on lines +913 to +942
#[test]
fn test_as_string_indirect_delete() {
let mut pdf = PdfDocument::new();

let original = "A".repeat(1000);

// 1. Create a direct string object (refcount=1)
let string_obj = PdfObject::new_string(&original).unwrap();

// 2. Add to document xref — returns an indirect reference.
// Internally: pdf_update_object keeps the direct obj (refcount=2),
// and returns a new indirect ref.
let indirect = pdf.add_object(&string_obj).unwrap();
assert!(indirect.is_indirect().unwrap());
let obj_num = indirect.as_indirect().unwrap();

// 3. Get string from the indirect ref.
let prd_str = indirect.as_string().unwrap();
assert_eq!(prd_str, original);

// 4. Drop the original direct PdfObject (refcount drops to 1)
// (only the xref entry still holds a reference)
drop(string_obj);

// 5. Delete the xref entry (refcount drops to 0),
// object and its text buffer are freed.
pdf.delete_object(obj_num).unwrap();

assert_eq!(prd_str, original);
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These indirect-delete regression tests don’t currently try to force allocator reuse after delete_object(). With the old borrowed-return behavior, they could still pass if freed memory happens to remain unchanged, making the tests less effective at catching regressions. Consider adding some churn (e.g., allocate/insert multiple large strings/names after deletion, similar to the reproducer in #207) so the tests would reliably fail if a dangling borrow is reintroduced.

Copilot uses AI. Check for mistakes.
Comment on lines +232 to +233
// SAFETY: dara is stored in heap (checked above).
let heap = unsafe { self.as_heap() };
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in safety comment: "dara" should be "data" (also appears in a few nearby safety comments).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@itsjunetime itsjunetime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do appreciate the work that's gone into implementing this and ensuring it has all these very desirable properties. It's cool to see.

However, I don't think we should merge this as-is. There are too many potential vectors for unsoundness, which is not super great especially since we're trying to add this to avoid unsoundness.

Imo, structs like this are generally best off living in their own crate, where they are more easily auditable and analyzable. I think putting this sort of CStr representation in its own crate would also just be nice for the ecosystem - I doubt mupdf is the first crate to want a nicely-optimized small c string.

I also think that this would be better off using some sort of property testing (running under miri, ofc) to suss out other instances of potential unsoundness. A simple proptest that just generates nonsense strings and vecs of bytes, passes them through a few different functions that we've added here, and validates that they come out the other end looking exactly the same, would be a nice way to feel more confident about this.

///
/// `Spilled` (= 24) is impossible for any inline variant because the remainder can never exceed 23.
/// This unused value becomes the heap discriminant and provides a niche for `Option` optimization.
#[allow(dead_code)]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the allow(dead_code)? Isn't this used just above?

pub(super) struct HeapRepr {
vector: ManuallyDrop<EcoVec<u8>>,
_pad: [u8; PAD_LENGTH],
last: LastByte, // always LastByte::Spilled
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's always spilled, why don't we just make something like

#[repr(u8)]
enum SpilledLastByte {
    Spilled = 24
}

to guarantee it's always spilled?

#[inline]
fn new(vector: EcoVec<u8>) -> Self {
Self {
vector: ManuallyDrop::new(vector),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this function can cause unsoundness - if the EcoVec doesn't have a null terminator, then some of the other functions (such as as_cstr()) which assume that it does have a null terminator, may read past the end of the allocation. And this function doesn't guarantee that the provided vector has a null terminator.

#[inline]
const unsafe fn as_heap(&self) -> &HeapRepr {
debug_assert!(self.is_heap());
unsafe { &*(self as *const Self).cast::<HeapRepr>() }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty evident that it's fine because self.is_heap() and you're trusting the caller, but I think it would still be nice to try to have a SAFETY comment per-unsafe block. This also applies to the as_mut_heap fn below.

Comment on lines +283 to +289
/// # Safety
/// Must only be called when `is_heap()`.
#[inline]
unsafe fn as_mut_heap(&mut self) -> &mut HeapRepr {
debug_assert!(self.is_heap());
unsafe { &mut *(self as *mut Self).cast::<HeapRepr>() }
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function needs both:

  1. More documentation to make it clear what you're allowed to modify within the &mut HeapRepr, since modifying the last_byte or replacing the EcoVec could cause unsoundness
  2. A change to the last_byte within the HeapRepr to only have a single variant with the of value 24, so that there's no way someone could modify the last_byte to cause unsoundness.

Comment on lines +362 to +363
const _: () = assert!(size_of::<Repr>() == INLINE_TOTAL);
const _: () = assert!(size_of::<HeapRepr>() == INLINE_TOTAL);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but I can't figure out how these hold. Repr has INLINE_TOTAL u8s + a single other byte - it should be INLINE_TOTAL + 1 bytes, afaict. It's not a union; they're sequential in memory. In a fairly similar rust playground, this condition doesn't hold. What am I missing?

impl From<&str> for PdfString {
#[inline]
fn from(s: &str) -> Self {
Self(PdfVec::from(s.as_bytes()))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this comment

/// Returns the payload length in bytes (without the NUL terminator).
#[inline]
pub fn len(&self) -> usize {
self.as_slice().len()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we could probably get the length more quickly, right? Just check the last_byte and if it's spilled, then query the ecowvec?

Comment on lines +72 to +74
/// Returns the content as a `&CStr` without allocation.
#[inline]
pub fn as_cstr(&self) -> &CStr {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this


/// Layout for the heap variant.
///
/// The `EcoVec<u8>` always stores the payload with a trailing NUL byte.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The `EcoVec<u8>` always stores the payload with a trailing NUL byte.
/// Whenever this is constructed, we make sure to push a trailing NUL byte to the
/// end of the `EcoVec<u8>`. It is unsound to construct this without the trailing
/// NUL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UB with PdfObject as_string, as_name, as_bytes methods

4 participants