Skip to content

Undefined Behavior in Unicode Numeric Parsing impl_from_formatted_str #52

@shinmao

Description

@shinmao

Hi, we are the security researchers from SunLab focusing on Rust. We discovered the parsing functionality that can lead to UB when processing Unicode numeric characters.

if c.is_numeric() {
if index > BUF_LEN {
return Err(Error::parse_number(&s));
}
buf[index] = c as u8;
index += 1;
}
}
if index == 0 {
return Err(Error::parse_number(&s));
}
let s2 = unsafe { str::from_utf8_unchecked(&buf[..index]) };

The vulnerability exists in parsing.rs we included above, where the code incorrectly handles non-ASCII numeric characters, creating invalid UTF-8 strings through unsafe operations str::from_utf8_unchecked.

The code at line 94 uses c.is_numeric() which accepts all Unicode numeric characters, not just ASCII digits (0-9). Next at line 98, unicode characters are truncated to u8 with c as u8, discarding the high bytes. Then in line 107, the buffer containing invalid UTF-8 bytes is used to construct a &str via from_utf8_unchecked(), which assumes valid UTF-8 without verification. This violates Rust's safety and constitutes UB.

The bytes passed in must be valid UTF-8.

Proof of Concept on Invalid UTF-8 Generation

use num_format::Locale;
use num_format::parsing::ParseFormatted;

fn main() {
    let test_cases = vec![
        ("𝟘", "U+1D7D8", "MATHEMATICAL DOUBLE-STRUCK DIGIT ZERO"),
        ("①", "U+2460", "CIRCLED DIGIT ONE"),
        ("½", "U+00BD", "VULGAR FRACTION ONE HALF"),
    ];

    for (input, unicode, description) in test_cases {
        println!("Testing: {} ({}, {})", input, unicode, description);
        
        let c = input.chars().next().unwrap();
        let truncated = c as u8;
        println!("  Codepoint: U+{:04X}", c as u32);
        println!("  Truncated to: 0x{:02X}", truncated);
        
        match std::str::from_utf8(&[truncated]) {
            Ok(_) => println!("  Valid UTF-8"),
            Err(_) => println!("  INVALID UTF-8 - Will cause UB!"),
        }
        
        match input.parse_formatted::<_, u32>(&Locale::en) {
            Ok(n) => println!("  Parsed: {}", n),
            Err(e) => println!("  Error: {}", e),
        }
        println!();
    }
}

Output:

Testing: 𝟘 (U+1D7D8, MATHEMATICAL DOUBLE-STRUCK DIGIT ZERO)
  Codepoint: U+1D7D8
  Truncated to: 0xD8
  INVALID UTF-8 - Will cause UB!
  Error: Failed to parse 𝟘 into a valid locale.

To be more sound, we can limit the inputs to be ASCII number.

Thanks for reading. Let me know if you have any question about this report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions