fix: Match Spark for historical Asia/Shanghai DST in unix_timestamp#516
Open
zhangxffff wants to merge 2 commits intobytedance:mainfrom
Open
fix: Match Spark for historical Asia/Shanghai DST in unix_timestamp#516zhangxffff wants to merge 2 commits intobytedance:mainfrom
zhangxffff wants to merge 2 commits intobytedance:mainfrom
Conversation
Previously unix_timestamp / get_timestamp threw VeloxUserError on Asia/Shanghai DST-gap local times (e.g. '19490501 00:00' sits in the 00:00 CST -> 01:00 CDT gap) and silently diverged from Spark by 1h for any moment inside a historical DST period (1919, 1940s wartime, 1986-1991). Root causes: the Shanghai fast-path used STDOFF-only arithmetic, and an ordering bug meant Shanghai never reached that path when sessionTimeZone_ was set. - Widen calculateCnUnixTimestamp to always route through tzdata and pre-correct nonexistent local times via correct_nonexistent_time, so gaps forward-shift like JVM ZonedDateTime.ofLocal. - Resolve the zone pointer from sessionTzID_ in the cold fallback path so it also gets gap correction instead of throwing. - Reorder the isShanghai branch before sessionTimeZone_ in the two getResultInGMT / getTimestampResultInGMT overloads so Shanghai actually reaches the fast-path. - Change is scoped to sparksql: Timestamp::toGMT is unchanged, so Presto / cast / low-level callers keep strict gap semantics. Adds unixTimestampShanghaiDstGap covering gap-start midnights (Shang and PRC eras), interior DST moments, and non-DST baselines. Adds a disabled exhaustive test that cross-checks every hour from 1900 to 2026 against Spark (recipe in the test comment); local run showed 1,104,528 rows with 0 mismatches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Weixin-Xu
reviewed
Apr 20, 2026
| @@ -139,16 +139,28 @@ struct UnixTimestampFunctionBase { | |||
|
|
|||
| // calculate china/shanghai unix-timestamp | |||
| Timestamp calculateCnUnixTimestamp(Timestamp& timestamp) { | |||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
For Asia/Shanghai session timezone,
unix_timestamp/get_timestamp(and anything routing through
UnixTimestampFunctionBase) had two bugs:'19490501'parses as1949-05-01 00:00Shanghai, which sits in the DST-start gap (clocksjumped
00:00 CST → 01:00 CDT).date::time_zone::to_sysrejectedthis as
nonexistent_local_timeand Bolt surfaced it asINVALID_ARGUMENT, aborting the task. Sparksilently forward-shifts through the gap (
ZonedDateTime.ofLocal).Shanghai fast-path used STDOFF-only arithmetic, so every moment
inside the 1919, 1940–1949 (wartime), and 1986–1991 DST windows
returned
naive − 8hinstead of Spark'snaive − 9h.Root cause for (1): an ordering bug in the two
getResultInGMT/getTimestampResultInGMToverloads — thesessionTimeZone_ != nullptrbranch ran beforeisShanghai, soShanghai never actually reached
calculateCnUnixTimestamp.Root cause for (2):
calculateCnUnixTimestamponly routed throughtzdata for the 1986-1991 window, everything else fell back to
naive − sessionTzOffsetInSeconds_.Issue Number: #515
Type of Change
Description
Scope of the change is intentionally kept to
sparksqlonly —Timestamp::toGMTis not modified, so Presto / cast / low-levelcallers keep strict "throw on gap" semantics.
calculateCnUnixTimestampnow routes through tzdata for everyShanghai regime (LMT, CST, and all historical DST windows). Before
calling
toGMT, local seconds are passed throughTimeZone::correct_nonexistent_time, which forward-shifts gaplocal times to the post-gap offset — matching JVM's
ZonedDateTime.ofLocalgap branch exactly.sessionTzID_-only path (reachable whensetSpecTimezonefell through to
setTimezone) now resolves the zone viatz::locateZone(id)and applies the same gap correction, so it nolonger throws on Shanghai gap midnights.
getResultInGMT(DateTimeResultValue&)andgetTimestampResultInGMT(DateTimeResultValue&): reordered soisShanghaiis checked beforesessionTimeZone_. This routesShanghai sessions through
calculateCnUnixTimestampas intended.Why no global fix to
Timestamp::toGMT? The method is also used byPresto functions and cast expressions, which currently rely on
strict-fail-on-gap behavior. A narrower fix inside sparksql avoids
any blast radius to those paths.
Performance Impact
One extra
correct_nonexistent_timecall per row in the Shanghaipath (a single
get_infolookup that was already happening insideto_sys). Cold-path adds atz::locateZonelookup, but that pathis only reached in edge cases where
setSpecTimezonefailed.Release Note
Release Note:
Checklist (For Author)
Testing:
unixTimestampShanghaiDstGapwith 17 assertions coveringDST-start midnights (Shang rule + PRC rule), interior DST
moments, and non-DST baselines — all values confirmed against
Spark.
DISABLED_) exhaustive testunixTimestampShanghaiAgainstSparkthat cross-checks every hourfrom
1900-01-01 00:00to2026-01-01 23:00(1,104,528 rows)against a TSV produced by Spark. The regen recipe (Spark-SQL
query + invocation) is embedded in the test's doc comment.
Breaking Changes