Skip to content

libcpp caused XID_Start codepoints detection error. #4510

@NerdTook

Description

@NerdTook

libcpp caused XID_Start codepoints detection error [bug]


  1. arabic indic digit 4 can be used as a start identifier

// test.rs
fn main() {
    let ٤ = 42;   // U+0664 ARABIC-INDIC DIGIT FOUR and it should not be a start identifier
    println!("{}", ٤);
}
nerdtook@PC:~/Temporary/nccp$ gcc test.rs -frust-incomplete-and-experimental-compiler-do-not-use
test.rs:3:9: warning: unused name ‘٤’ [-Wunused-variable]
    3 |     let ٤ = 42;   // U+0664 ARABIC-INDIC DIGIT FOUR

but gccrs accept it as a valid identifier.


  1. wrong code here: libcpp/character.cc

/* Returns flags representing the XID properties of the given codepoint.  */
unsigned int
cpp_check_xid_property (cppchar_t c)
{
  ...
  if (flags & CXX23)
    return CPP_XID_START | CPP_XID_CONTINUE;
  if (flags & NXX23)
    return CPP_XID_CONTINUE;
  return 0;
}

  1. switch those 2 if-statements to fix it, otherwise 2nd case never touched.

-  if (flags & CXX23)
-    return CPP_XID_START | CPP_XID_CONTINUE;
   if (flags & NXX23)
     return CPP_XID_CONTINUE;
+  if (flags & CXX23)
+    return CPP_XID_START | CPP_XID_CONTINUE;

  1. why this happening?

In libcpp/makeucnid.cc

static void
read_derivedcore (char *fname)
{
...
if (strncmp (l, "XID_Start ", 10) == 0)
	{
	  for (; codepoint_start <= codepoint_end; codepoint_start++)
	    flags[codepoint_start]
	      = (flags[codepoint_start] | CXX23) & ~NXX23;
	}
      else if (strncmp (l, "XID_Continue ", 13) == 0)
	{
	  for (; codepoint_start <= codepoint_end; codepoint_start++)
	    if ((flags[codepoint_start] & CXX23) == 0)
	      flags[codepoint_start] |= CXX23 | NXX23;
	}
...
}

there is the only place set flags CXX23 and NXX23, which means:

for any coepoints if it have DerivedCoreProperty:
XID start -> 1 CXX23 0NXX23
XID continue -> 1 CXX23 1 NXX23
none of those above -> 0 CXX23 0 NXX23

it never happens as "0 CXX23 and 1 NXX23", so if test XID start/continue property by:

  if (flags & CXX23)
    return CPP_XID_START | CPP_XID_CONTINUE;
  if (flags & NXX23)
    return CPP_XID_CONTINUE;

an XID continue character will be mistakely treated as an CPP_XID_START character, therefore

           let ٤ = 42;   // U+0664 ARABIC-INDIC DIGIT FOUR and it should not be a start identifier

becomes an valid identifier.

static const struct ucnrange ucnranges[] = {
...
{   0|  0|  0|C11|  0|CXX23|NXX23|CID|NFC|NKC|  0, 220, 0x065f },
{ C99|N99|  0|C11|  0|CXX23|NXX23|CID|NFC|NKC|  0,   0, 0x0669 },  // <--- U+0664 belongs to range [0x0660, 0x0669], CXX23|NXX23 all sets, NXX23 means it should not be an start identifer too.
...
}

I also trying to report this to gcc, but I'm waitting for requesting an account ...


  1. range

This effects only rust-lex.cc, other gcc part not depend on function "cpp_check_xid_property" but rather they tests NXX23 bits directly, after make sure it have CXX23 seted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions