Skip to content

[ASTERIXDB-2877] Fix multi-byte character handling in CSV output#37

Open
ongdisheng wants to merge 2 commits intoapache:masterfrom
ongdisheng:ASTERIXDB-2877
Open

[ASTERIXDB-2877] Fix multi-byte character handling in CSV output#37
ongdisheng wants to merge 2 commits intoapache:masterfrom
ongdisheng:ASTERIXDB-2877

Conversation

@ongdisheng
Copy link

@ongdisheng ongdisheng commented Mar 3, 2026

Description

There are two bugs in writeUTF8StringAsCSV in PrintTools.java:

  1. Incorrect loop step in the quoting scan:
    The loop that checks whether a string needs quoting advanced one byte at a time (i++). Since multi-byte characters span 2 to 4 bytes, this might actually cause charAt() to be called at offsets pointing to the middle of a character.

  2. Incorrect character writing:
    Characters were written using PrintStream.print(char), which converts the char through the platform default charset before writing. For multi-byte characters such as é and CJK characters (e.g. ), this re-encoding could produce incorrect bytes if the platform default charset is not UTF-8. Writing the raw UTF-8 bytes directly from the array is always correct regardless of platform.

Fix

Added a fix for the quoting scan loop so that it now advances by UTF8StringUtil.charSize() per iteration and charAt() is always called at a valid character boundary. Characters are now written as raw UTF-8 bytes directly, which is also consistent with how writeUTF8StringAsJSON already handles the same data.

How to Reproduce and Verify

Setup
disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode 'statement=
DROP DATAVERSE test IF EXISTS;
CREATE DATAVERSE test;
USE test;
CREATE TYPE TweetType AS { id: int, text: string };
CREATE DATASET tweets(TweetType) PRIMARY KEY id;
' "http://localhost:19002/query/service"

disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode 'statement=
USE test;                                                       
INSERT INTO tweets ({"id": 1, "text": "@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋"});
' "http://localhost:19002/query/service"
Before fix
disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;"      "http://localhost:19002/query/service"
{
        "requestID": "3056c5df-fb14-4ff8-90b6-dcc62662a563",
        "signature": {
                "*": "*"
        },
        "results": [ {"text":"@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋"} ]
        ,
        "plans":{},
        "status": "success",
        "metrics": {
                "elapsedTime": "28.199702ms",
                "executionTime": "26.761674ms",
                "compileTime": "10.872533ms",
                "queueWaitTime": "0ns",
                "resultCount": 1,
                "resultSize": 111,
                "processedObjects": 1
        }
}
disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;"      "http://localhost:19002/query/service?format=csv&header=absent"
{
        "requestID": "6bbf0b3d-05a3-43e5-ad61-d3ebeee20e5e",
        "type": "text/csv; header=absent",
        "signature": {
                "*": "*"
        },
        "errors": [{ 
                "code": 1,              "msg": "java.lang.IllegalArgumentException" } 
        ],
        "status": "fatal",
        "metrics": {
                "elapsedTime": "26.795447ms",
                "executionTime": "25.588135ms",
                "compileTime": "11.320646ms",
                "queueWaitTime": "0ns",
                "resultCount": 0,
                "resultSize": 0,
                "processedObjects": 0,
                "bufferCacheHitRatio": "0.00%",
                "bufferCachePageReadCount": 0,
                "errorCount": 1
        }
}
After fix
disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;" \
     "http://localhost:19002/query/service"
{
        "requestID": "6f404f34-1726-42d7-ba7a-990d9b08cd0d",
        "signature": {
                "*": "*"
        },
        "results": [ {"text":"@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋"} ]
        ,
        "plans":{},
        "status": "success",
        "metrics": {
                "elapsedTime": "138.660107ms",
                "executionTime": "134.309116ms",
                "compileTime": "43.763632ms",
                "queueWaitTime": "0ns",
                "resultCount": 1,
                "resultSize": 111,
                "processedObjects": 1
        }
}
disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;" \
     "http://localhost:19002/query/service?format=csv&header=absent"
{
        "requestID": "855ebc1a-6816-49ef-bbcb-7e99fe6e5ef0",
        "type": "text/csv; header=absent",
        "signature": {
                "*": "*"
        },
        "results": [ "@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋" ]
        ,
        "plans":{},
        "status": "success",
        "metrics": {
                "elapsedTime": "34.516791ms",
                "executionTime": "32.945963ms",
                "compileTime": "11.91599ms",
                "queueWaitTime": "1ms",
                "resultCount": 1,
                "resultSize": 102,
                "processedObjects": 1
        }
}

JIRA Issue

https://issues.apache.org/jira/browse/ASTERIXDB-2877

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant