Undefined Behavior in Unicode Numeric Parsing `impl_from_formatted_str`

Hi, we are the security researchers from [SunLab](https://github.com/SunLab-GMU) focusing on Rust. We discovered the parsing functionality that can lead to **UB** when processing Unicode numeric characters.

https://github.com/bcmyers/num-format/blob/c2173715e17a48de2e0b453972715f78a8a7594b/num-format/src/parsing.rs#L94-L107

The vulnerability exists in `parsing.rs` we included above, where the code incorrectly handles non-ASCII numeric characters, creating invalid UTF-8 strings through unsafe operations `str::from_utf8_unchecked`.

The code at line 94 uses `c.is_numeric()` which accepts **all Unicode numeric characters**, not just ASCII digits (0-9). Next at line 98,  unicode characters are truncated to `u8` with `c as u8`, discarding the high bytes. Then in line 107, the buffer containing invalid UTF-8 bytes is used to construct a `&str` via `from_utf8_unchecked()`, which **assumes valid UTF-8** without verification. This violates Rust's safety and constitutes UB.

> The bytes passed in must be valid UTF-8.

## Proof of Concept on Invalid UTF-8 Generation
```rust
use num_format::Locale;
use num_format::parsing::ParseFormatted;

fn main() {
    let test_cases = vec![
        ("𝟘", "U+1D7D8", "MATHEMATICAL DOUBLE-STRUCK DIGIT ZERO"),
        ("①", "U+2460", "CIRCLED DIGIT ONE"),
        ("½", "U+00BD", "VULGAR FRACTION ONE HALF"),
    ];

    for (input, unicode, description) in test_cases {
        println!("Testing: {} ({}, {})", input, unicode, description);
        
        let c = input.chars().next().unwrap();
        let truncated = c as u8;
        println!("  Codepoint: U+{:04X}", c as u32);
        println!("  Truncated to: 0x{:02X}", truncated);
        
        match std::str::from_utf8(&[truncated]) {
            Ok(_) => println!("  Valid UTF-8"),
            Err(_) => println!("  INVALID UTF-8 - Will cause UB!"),
        }
        
        match input.parse_formatted::<_, u32>(&Locale::en) {
            Ok(n) => println!("  Parsed: {}", n),
            Err(e) => println!("  Error: {}", e),
        }
        println!();
    }
}
```

**Output:**
```
Testing: 𝟘 (U+1D7D8, MATHEMATICAL DOUBLE-STRUCK DIGIT ZERO)
  Codepoint: U+1D7D8
  Truncated to: 0xD8
  INVALID UTF-8 - Will cause UB!
  Error: Failed to parse 𝟘 into a valid locale.
```
To be more sound, we can limit the inputs to be ASCII number. 

Thanks for reading. Let me know if you have any question about this report.

	if c.is_numeric() {
	if index > BUF_LEN {
	return Err(Error::parse_number(&s));
	}
	buf[index] = c as u8;
	index += 1;
	}
	}

	if index == 0 {
	return Err(Error::parse_number(&s));
	}

	let s2 = unsafe { str::from_utf8_unchecked(&buf[..index]) };

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undefined Behavior in Unicode Numeric Parsing `impl_from_formatted_str` #52

Proof of Concept on Invalid UTF-8 Generation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Undefined Behavior in Unicode Numeric Parsing impl_from_formatted_str #52

Description

Proof of Concept on Invalid UTF-8 Generation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Undefined Behavior in Unicode Numeric Parsing `impl_from_formatted_str` #52