Skip to content

Harden validation, fix correctness issues, improve test coverage#6

Open
rhaist wants to merge 1 commit intojdemaeyer:masterfrom
rhaist:fix/code-review-hardening
Open

Harden validation, fix correctness issues, improve test coverage#6
rhaist wants to merge 1 commit intojdemaeyer:masterfrom
rhaist:fix/code-review-hardening

Conversation

@rhaist
Copy link
Copy Markdown

@rhaist rhaist commented Mar 23, 2026

Apology and context

First off — sorry for the size of this PR. I sat down to do a thorough code review of dwdparse (we use it downstream via Bright Sky) and ended up finding a cluster of small-but-real issues that are easier to reason about as a single coherent diff than as 15 separate PRs. Every change is explained below so you can evaluate each one independently. Nothing here changes the public API contract or output format.

I verified all changes against brightsky to ensure nothing breaks downstream (Bright Sky subclasses most parsers, uses dwdparse.utils.fetch, and monkey-patches _converter in tests — all still work).


Changes by category

1. Security & robustness

utils.pyfetch() now has a 10s default timeout

urllib.request.urlopen() without a timeout will block indefinitely if a server hangs mid-response. Added timeout=10 as a keyword argument so callers can override if needed. Bright Sky's usage (station list download) is well within 10s.

parsers.pyassert → proper exceptions for external data validation

There were ~15 assert statements validating data from DWD files (product type, grid dimensions, byte counts, XML structure). Running Python with -O strips all assertions, silently turning these into no-ops. Replaced each with ValueError (bad data) or RuntimeError (architecture mismatch), with descriptive error messages that include the expected vs actual values. This affects MOSMIXParser._parse_stream, MOSMIXParser.parse_station, ObservationsParser.parse_records, RADOLANParser.parse_header, RADOLANParser.parse_data, and RadarParser.verify_meta.

2. Correctness

api.pyparse()/parse_url() raise ValueError for unrecognized filenames

Previously, if get_parser() returned None (no matching parser), both functions would call None()TypeError: 'NoneType' is not callable. Now they raise a clear ValueError with the filename. get_parser() itself still returns None for no match (intentional — tested and used by Bright Sky).

parsers.pyMOSMIXParser.sanitize_records truthiness bug

The sanitization checks used if r['precipitation'] which is False for 0.0. This meant zero precipitation, zero wind speed, zero cloud cover all skipped validation. Changed all guards to is not None. The existing behavior happened to be correct by coincidence (0 is always valid), but the intent was wrong and would mask bugs if DWD ever produced e.g. cloud_cover = 0.0 alongside an impossible value in another field.

parsers.py — Wind direction uses % 360 instead of - 360

The old fix r['wind_direction'] -= 360 only corrects values in the 360–720 range. A value of 725° would become 365° (still invalid). Changed to % 360 which handles any value.

parsers.pyPressureObservationsParser uses is None instead of not

if not elements['pressure_msl'] treats 0 as missing, triggering the barometric approximation for a valid zero reading. Changed to is None. (Atmospheric pressure is never 0 Pa in practice, but the semantic was wrong.)

parsers.pyCAPParser.sanitize_event Z-suffix in fromisoformat()

datetime.fromisoformat() doesn't accept Z as timezone on Python 3.9–3.10 (fixed in 3.11). The MOSMIX parser already had the re.sub(r'Z$', '+00:00', ...) workaround — applied the same pattern to CAP timestamp parsing for consistency across the supported Python range.

units.py — Document sentinel pattern in _find()

Added a docstring explaining that each condition mapping must end with a sentinel entry (value None) to avoid the last real entry "leaking" for codes above the highest key. This is a subtle invariant that's easy to break when adding new entries.

3. Performance

parsers.pystr.split() instead of re.split(r'\s+', ...) in MOSMIX

The comment says ~50% of parse time is spent in the value-splitting loop. str.split() (no args) splits on any whitespace and strips leading/trailing — identical behavior to re.split(r'\s+', s.strip()) but avoids regex overhead. This is a free micro-optimization in a hot loop.

4. Code quality

parsers.py — Remove broken data_length property from RadarParser

RadarParser.data_length references self.BYTES_PER_PIXEL which is defined on RADOLANParser, not RadarParser (they don't share an inheritance chain). Calling it would raise AttributeError. The property is unused — RadarParser reads HDF5 files which are self-describing, so no byte-count check is needed.

parsers.py — Deduplicate _is_tag() into base Parser class

MOSMIXParser._is_tag(element, tag, ns) and CAPParser._is_tag(element, tag) were near-identical (only differing in whether ns was a parameter or self.ns). Moved the 3-arg version to Parser as a @staticmethod. Updated CAPParser.parse_event() to pass self.ns explicitly.

scripts/benchmark.py — Fix dead method reference

StationIDConverter.update doesn't exist — the method is called load. The benchmark was silently assigning a lambda to a nonexistent attribute.

5. Packaging

setup.py — Parse version from file instead of importing

import dwdparse at build time executes all module-level code (including importing parsers, units, stations). This can fail in clean environments where optional deps aren't installed yet, or cause unexpected side effects. Now uses a regex to extract __version__ from __init__.py.

setup.py + pyproject.toml — Bump minimum Python to 3.9

python_requires was >=3.8 but CI only tests 3.9–3.13. The walrus operator (:=) used throughout requires 3.8+ as a floor, but 3.8 reached EOL in October 2024 and is untested. Aligned the declared minimum with the CI matrix. Updated ruff target-version to match.

6. Tests (14 new)

Test What it covers
test_get_parser_unknown_filename get_parser() returns None for unrecognized files
test_parse_raises_for_unknown_file api.parse() raises ValueError
test_mosmix_sanitize_zero_values Zero values pass through sanitization unchanged
test_mosmix_sanitize_negative_values Negative precip/wind/cloud are corrected
test_mosmix_sanitize_wind_direction_modulo 725° → 5° via modulo
test_mosmix_sanitize_cloud_cover_overflow 110% → clamped to 100%
test_mosmix_sanitize_none_values None fields don't trigger sanitization
test_current_observations_sanitize_boundary_values 100% cloud/humidity and 3600s sunshine are valid
test_current_observations_sanitize_overflow 101%/3601s are rejected
test_pressure_observations_approximation Barometric formula only triggers for None, not zero

Test plan

  • ruff check . — no issues
  • pytest — 40/40 pass (26 existing + 14 new)
  • Verified no breaking changes against brightsky's parser subclasses, fetch() usage, and station converter patching

Security & robustness:
- Add 10s timeout to fetch() to prevent indefinite hangs
- Replace all assert statements that validate external data with
  proper ValueError/RuntimeError exceptions (asserts are stripped
  under python -O)

Correctness:
- api.parse()/parse_url(): raise ValueError for unrecognized filenames
  instead of calling None() (TypeError)
- MOSMIXParser.sanitize_records: use 'is not None' checks instead of
  truthiness to correctly handle zero values (0.0 precipitation, 0
  cloud cover, etc.)
- MOSMIXParser: fix wind direction correction to use modulo (% 360)
  instead of subtraction (- 360), which failed for values > 720
- PressureObservationsParser: use 'is None' instead of 'not' for
  pressure_msl check to avoid treating 0 as missing
- CAPParser.sanitize_event: handle Z suffix in fromisoformat() for
  Python 3.9-3.10 compat (same workaround MOSMIX already had)
- Document sentinel pattern in units._find() threshold mappings

Performance:
- Replace re.split(r'\s+', ...) with str.split() in MOSMIX value
  parsing (~50% of parse time is spent in this loop)

Code quality:
- Remove broken data_length property from RadarParser (references
  undefined BYTES_PER_PIXEL attribute from RADOLANParser)
- Deduplicate _is_tag() into base Parser class as a static method
- Fix dead reference in benchmark.py (update -> load)

Packaging:
- Parse __version__ from file instead of importing dwdparse at
  setup.py build time (avoids import-time side effects)
- Bump python_requires to >=3.9 to match CI test matrix
- Update ruff target-version to py39 accordingly

Tests (14 new):
- get_parser() with unknown filename returns None
- api.parse() raises ValueError for unknown files
- MOSMIXParser.sanitize_records: zero values, negative values,
  None values, wind direction modulo, cloud cover overflow
- CurrentObservationsParser.sanitize_record: boundary and overflow
- PressureObservationsParser: barometric approximation trigger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant