fix(source-marketo): fix CSV column misalignment with CJK characters (AI-Triage PR)#74088
Draft
devin-ai-integration[bot] wants to merge 2 commits intomasterfrom
Draft
fix(source-marketo): fix CSV column misalignment with CJK characters (AI-Triage PR)#74088devin-ai-integration[bot] wants to merge 2 commits intomasterfrom
devin-ai-integration[bot] wants to merge 2 commits intomasterfrom
Conversation
…ine separator splitting
The CSV parser in source-marketo uses requests.iter_lines() which
internally calls str.splitlines(). Python's splitlines() splits on
Unicode line separators (U+2028, U+2029) in addition to \n and \r\n.
When CJK text fields contain these Unicode characters, splitlines()
incorrectly splits a single CSV row into multiple lines, causing
column misalignment. This results in the primary key 'id' field
being parsed as null, which triggers BasicAirbyteMessageValidator
to abort the sync.
Fix: Pass delimiter='\n' to iter_lines() so it uses str.split('\n')
instead of str.splitlines(), preserving Unicode line separators in
field values. Also strip \r to handle \r\n line endings.
This is the same class of bug fixed in #3327 for
Java destinations.
Resolves airbytehq/oncall#11468
Co-Authored-By: bot_apk <apk@cognition.ai>
Contributor
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Contributor
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksPR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
📚 Show Repo GuidanceHelpful Resources
|
Co-Authored-By: bot_apk <apk@cognition.ai>
Contributor
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes CSV column misalignment in the Marketo connector when syncing Leads containing CJK (Chinese, Japanese, Korean) characters. The misalignment causes the primary key
idto be parsed asnull, which triggersBasicAirbyteMessageValidatorto abort the sync.Resolves https://github.com/airbytehq/oncall/issues/11468:
Related OSS issue: #74087
How
requests.Response.iter_lines()internally uses Python'sstr.splitlines()when nodelimiteris specified.splitlines()splits on Unicode line separators (U+2028, U+2029) in addition to\n/\r\n. When CJK text fields contain these characters, a single CSV row is incorrectly split into multiple lines, misaligning all subsequent columns.The fix passes
delimiter="\n"toiter_lines()so it usesstr.split("\n")instead ofstr.splitlines(), preserving Unicode line separators within field values. An additionalrstrip("\r")handles\r\nline endings.This is the same class of bug previously fixed for Java destinations in #3327.
Review guide
source_marketo/source.py— the core fix (3 lines changed inparse_response). Key things to verify:if linefilter drops empty strings produced bysplit("\n"). The previoussplitlines()behavior also dropped trailing empty lines. Check whether dropping empty lines mid-stream could ever be problematic.rstrip("\r")strips trailing carriage returns. Check whether this could strip legitimate\rfrom the last field in a CSV row (extremely unlikely, but worth considering).unit_tests/test_source.py— updated existing mock fromsplitlines()tosplit("\n")to match new behavior, and added a new testtest_parse_response_with_unicode_line_separatorthat directly validates U+2028 in a field value doesn't cause misalignment.User Impact
Marketo connector syncs with CJK data that previously failed with null primary key errors should now succeed. No configuration changes required.
Can this PR be safely reverted and rolled back?
Link to Devin run: https://app.devin.ai/sessions/65603bd6dc3540e9b16e6be6c4c379c2
Requested by: bot_apk (apk@cognition.ai)