Skip to content

fix: Match Spark for historical Asia/Shanghai DST in unix_timestamp#516

Open
zhangxffff wants to merge 2 commits intobytedance:mainfrom
zhangxffff:main
Open

fix: Match Spark for historical Asia/Shanghai DST in unix_timestamp#516
zhangxffff wants to merge 2 commits intobytedance:mainfrom
zhangxffff:main

Conversation

@zhangxffff
Copy link
Copy Markdown
Collaborator

@zhangxffff zhangxffff commented Apr 17, 2026

What problem does this PR solve?

For Asia/Shanghai session timezone, unix_timestamp / get_timestamp
(and anything routing through UnixTimestampFunctionBase) had two bugs:

  1. Failed on DST-gap local times. '19490501' parses as
    1949-05-01 00:00 Shanghai, which sits in the DST-start gap (clocks
    jumped 00:00 CST → 01:00 CDT). date::time_zone::to_sys rejected
    this as nonexistent_local_time and Bolt surfaced it as
    INVALID_ARGUMENT, aborting the task. Spark
    silently forward-shifts through the gap (ZonedDateTime.ofLocal).
  2. Silent 1-hour divergence during historical DST periods. The
    Shanghai fast-path used STDOFF-only arithmetic, so every moment
    inside the 1919, 1940–1949 (wartime), and 1986–1991 DST windows
    returned naive − 8h instead of Spark's naive − 9h.

Root cause for (1): an ordering bug in the two
getResultInGMT / getTimestampResultInGMT overloads — the
sessionTimeZone_ != nullptr branch ran before isShanghai, so
Shanghai never actually reached calculateCnUnixTimestamp.
Root cause for (2): calculateCnUnixTimestamp only routed through
tzdata for the 1986-1991 window, everything else fell back to
naive − sessionTzOffsetInSeconds_.

Issue Number: #515

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Scope of the change is intentionally kept to sparksql only —
Timestamp::toGMT is not modified, so Presto / cast / low-level
callers keep strict "throw on gap" semantics.

  • calculateCnUnixTimestamp now routes through tzdata for every
    Shanghai regime (LMT, CST, and all historical DST windows). Before
    calling toGMT, local seconds are passed through
    TimeZone::correct_nonexistent_time, which forward-shifts gap
    local times to the post-gap offset — matching JVM's
    ZonedDateTime.ofLocal gap branch exactly.
  • The cold sessionTzID_-only path (reachable when setSpecTimezone
    fell through to setTimezone) now resolves the zone via
    tz::locateZone(id) and applies the same gap correction, so it no
    longer throws on Shanghai gap midnights.
  • getResultInGMT(DateTimeResultValue&) and
    getTimestampResultInGMT(DateTimeResultValue&): reordered so
    isShanghai is checked before sessionTimeZone_. This routes
    Shanghai sessions through calculateCnUnixTimestamp as intended.

Why no global fix to Timestamp::toGMT? The method is also used by
Presto functions and cast expressions, which currently rely on
strict-fail-on-gap behavior. A narrower fix inside sparksql avoids
any blast radius to those paths.

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
  • Positive Impact: I have run benchmarks.
  • Negative Impact: Explained below (e.g., trade-off for correctness).

One extra correct_nonexistent_time call per row in the Shanghai
path (a single get_info lookup that was already happening inside
to_sys). Cold-path adds a tz::locateZone lookup, but that path
is only reached in edge cases where setSpecTimezone failed.

Release Note

Release Note:

Release Note:
- Fixed a crash in `unix_timestamp` / `get_timestamp` when parsing Asia/Shanghai DST-gap local times (e.g. `'19490501 00:00'`).
- Fixed `unix_timestamp` returning a value 1 hour off from Spark for moments inside historical Asia/Shanghai DST periods (1919, 1940-1949, 1986-1991).

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Testing:

  • Added unixTimestampShanghaiDstGap with 17 assertions covering
    DST-start midnights (Shang rule + PRC rule), interior DST
    moments, and non-DST baselines — all values confirmed against
    Spark.
  • Added a disabled (DISABLED_) exhaustive test
    unixTimestampShanghaiAgainstSpark that cross-checks every hour
    from 1900-01-01 00:00 to 2026-01-01 23:00 (1,104,528 rows)
    against a TSV produced by Spark. The regen recipe (Spark-SQL
    query + invocation) is embedded in the test's doc comment.
  • Local verification run: 1,104,528 rows, 0 mismatches, ~245 ms.

Breaking Changes

  • No
  • Yes

@zhangxffff zhangxffff changed the title [fix] Match Spark for historical Asia/Shanghai DST in unix_timestamp fix: Match Spark for historical Asia/Shanghai DST in unix_timestamp Apr 17, 2026
Previously unix_timestamp / get_timestamp threw VeloxUserError on
Asia/Shanghai DST-gap local times (e.g. '19490501 00:00' sits in the
00:00 CST -> 01:00 CDT gap) and silently diverged from Spark by 1h for
any moment inside a historical DST period (1919, 1940s wartime,
1986-1991). Root causes: the Shanghai fast-path used STDOFF-only
arithmetic, and an ordering bug meant Shanghai never reached that path
when sessionTimeZone_ was set.

- Widen calculateCnUnixTimestamp to always route through tzdata and
  pre-correct nonexistent local times via correct_nonexistent_time,
  so gaps forward-shift like JVM ZonedDateTime.ofLocal.
- Resolve the zone pointer from sessionTzID_ in the cold fallback
  path so it also gets gap correction instead of throwing.
- Reorder the isShanghai branch before sessionTimeZone_ in the two
  getResultInGMT / getTimestampResultInGMT overloads so Shanghai
  actually reaches the fast-path.
- Change is scoped to sparksql: Timestamp::toGMT is unchanged, so
  Presto / cast / low-level callers keep strict gap semantics.

Adds unixTimestampShanghaiDstGap covering gap-start midnights
(Shang and PRC eras), interior DST moments, and non-DST baselines.
Adds a disabled exhaustive test that cross-checks every hour from
1900 to 2026 against Spark (recipe in the test comment); local run
showed 1,104,528 rows with 0 mismatches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@@ -139,16 +139,28 @@ struct UnixTimestampFunctionBase {

// calculate china/shanghai unix-timestamp
Timestamp calculateCnUnixTimestamp(Timestamp& timestamp) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个改动有做过 benchmark 吗?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有,我再测试一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants