[ASTERIXDB-2877] Fix multi-byte character handling in CSV output by ongdisheng · Pull Request #37 · apache/asterixdb

ongdisheng · 2026-03-03T14:00:58Z

Description

There are two bugs in writeUTF8StringAsCSV in PrintTools.java:

Incorrect loop step in the quoting scan:
The loop that checks whether a string needs quoting advanced one byte at a time (i++). Since multi-byte characters span 2 to 4 bytes, this might actually cause charAt() to be called at offsets pointing to the middle of a character.
Incorrect character writing:
Characters were written using PrintStream.print(char), which converts the char through the platform default charset before writing. For multi-byte characters such as é and CJK characters (e.g. 中), this re-encoding could produce incorrect bytes if the platform default charset is not UTF-8. Writing the raw UTF-8 bytes directly from the array is always correct regardless of platform.

Fix

Added a fix for the quoting scan loop so that it now advances by UTF8StringUtil.charSize() per iteration and charAt() is always called at a valid character boundary. Characters are now written as raw UTF-8 bytes directly, which is also consistent with how writeUTF8StringAsJSON already handles the same data.

How to Reproduce and Verify

Setup

disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode 'statement=
DROP DATAVERSE test IF EXISTS;
CREATE DATAVERSE test;
USE test;
CREATE TYPE TweetType AS { id: int, text: string };
CREATE DATASET tweets(TweetType) PRIMARY KEY id;
' "http://localhost:19002/query/service"

disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode 'statement=
USE test;                                                       
INSERT INTO tweets ({"id": 1, "text": "@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋"});
' "http://localhost:19002/query/service"

Before fix

disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;"      "http://localhost:19002/query/service"
{
        "requestID": "3056c5df-fb14-4ff8-90b6-dcc62662a563",
        "signature": {
                "*": "*"
        },
        "results": [ {"text":"@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋"} ]
        ,
        "plans":{},
        "status": "success",
        "metrics": {
                "elapsedTime": "28.199702ms",
                "executionTime": "26.761674ms",
                "compileTime": "10.872533ms",
                "queueWaitTime": "0ns",
                "resultCount": 1,
                "resultSize": 111,
                "processedObjects": 1
        }
}
disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;"      "http://localhost:19002/query/service?format=csv&header=absent"
{
        "requestID": "6bbf0b3d-05a3-43e5-ad61-d3ebeee20e5e",
        "type": "text/csv; header=absent",
        "signature": {
                "*": "*"
        },
        "errors": [{ 
                "code": 1,              "msg": "java.lang.IllegalArgumentException" } 
        ],
        "status": "fatal",
        "metrics": {
                "elapsedTime": "26.795447ms",
                "executionTime": "25.588135ms",
                "compileTime": "11.320646ms",
                "queueWaitTime": "0ns",
                "resultCount": 0,
                "resultSize": 0,
                "processedObjects": 0,
                "bufferCacheHitRatio": "0.00%",
                "bufferCachePageReadCount": 0,
                "errorCount": 1
        }
}

After fix

disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;" \
     "http://localhost:19002/query/service"
{
        "requestID": "6f404f34-1726-42d7-ba7a-990d9b08cd0d",
        "signature": {
                "*": "*"
        },
        "results": [ {"text":"@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋"} ]
        ,
        "plans":{},
        "status": "success",
        "metrics": {
                "elapsedTime": "138.660107ms",
                "executionTime": "134.309116ms",
                "compileTime": "43.763632ms",
                "queueWaitTime": "0ns",
                "resultCount": 1,
                "resultSize": 111,
                "processedObjects": 1
        }
}
disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE test; SELECT text FROM tweets;" \
     "http://localhost:19002/query/service?format=csv&header=absent"
{
        "requestID": "855ebc1a-6816-49ef-bbcb-7e99fe6e5ef0",
        "type": "text/csv; header=absent",
        "signature": {
                "*": "*"
        },
        "results": [ "@ScapegoatHelp Walked out on being scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋" ]
        ,
        "plans":{},
        "status": "success",
        "metrics": {
                "elapsedTime": "34.516791ms",
                "executionTime": "32.945963ms",
                "compileTime": "11.91599ms",
                "queueWaitTime": "1ms",
                "resultCount": 1,
                "resultSize": 102,
                "processedObjects": 1
        }
}

JIRA Issue

https://issues.apache.org/jira/browse/ASTERIXDB-2877

…V output

ongdisheng added 2 commits March 1, 2026 12:38

[ASTERIXDB-2877][CSV] Fix multi-byte/emoji character corruption in CS…

d6d7ced

…V output

expand PrintToolsTest coverage

2b3ba8b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ASTERIXDB-2877] Fix multi-byte character handling in CSV output#37

[ASTERIXDB-2877] Fix multi-byte character handling in CSV output#37
ongdisheng wants to merge 2 commits intoapache:masterfrom
ongdisheng:ASTERIXDB-2877

ongdisheng commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ongdisheng commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Fix

How to Reproduce and Verify

JIRA Issue

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ongdisheng commented Mar 3, 2026 •

edited

Loading