Harden validation, fix correctness issues, improve test coverage by rhaist · Pull Request #6 · jdemaeyer/dwdparse

rhaist · 2026-03-23T09:38:18Z

Apology and context

First off — sorry for the size of this PR. I sat down to do a thorough code review of dwdparse (we use it downstream via Bright Sky) and ended up finding a cluster of small-but-real issues that are easier to reason about as a single coherent diff than as 15 separate PRs. Every change is explained below so you can evaluate each one independently. Nothing here changes the public API contract or output format.

I verified all changes against brightsky to ensure nothing breaks downstream (Bright Sky subclasses most parsers, uses dwdparse.utils.fetch, and monkey-patches _converter in tests — all still work).

Changes by category

1. Security & robustness

utils.py — fetch() now has a 10s default timeout

urllib.request.urlopen() without a timeout will block indefinitely if a server hangs mid-response. Added timeout=10 as a keyword argument so callers can override if needed. Bright Sky's usage (station list download) is well within 10s.

parsers.py — assert → proper exceptions for external data validation

There were ~15 assert statements validating data from DWD files (product type, grid dimensions, byte counts, XML structure). Running Python with -O strips all assertions, silently turning these into no-ops. Replaced each with ValueError (bad data) or RuntimeError (architecture mismatch), with descriptive error messages that include the expected vs actual values. This affects MOSMIXParser._parse_stream, MOSMIXParser.parse_station, ObservationsParser.parse_records, RADOLANParser.parse_header, RADOLANParser.parse_data, and RadarParser.verify_meta.

2. Correctness

api.py — parse()/parse_url() raise ValueError for unrecognized filenames

Previously, if get_parser() returned None (no matching parser), both functions would call None() → TypeError: 'NoneType' is not callable. Now they raise a clear ValueError with the filename. get_parser() itself still returns None for no match (intentional — tested and used by Bright Sky).

parsers.py — MOSMIXParser.sanitize_records truthiness bug

The sanitization checks used if r['precipitation'] which is False for 0.0. This meant zero precipitation, zero wind speed, zero cloud cover all skipped validation. Changed all guards to is not None. The existing behavior happened to be correct by coincidence (0 is always valid), but the intent was wrong and would mask bugs if DWD ever produced e.g. cloud_cover = 0.0 alongside an impossible value in another field.

parsers.py — Wind direction uses % 360 instead of - 360

The old fix r['wind_direction'] -= 360 only corrects values in the 360–720 range. A value of 725° would become 365° (still invalid). Changed to % 360 which handles any value.

parsers.py — PressureObservationsParser uses is None instead of not

if not elements['pressure_msl'] treats 0 as missing, triggering the barometric approximation for a valid zero reading. Changed to is None. (Atmospheric pressure is never 0 Pa in practice, but the semantic was wrong.)

parsers.py — CAPParser.sanitize_event Z-suffix in fromisoformat()

datetime.fromisoformat() doesn't accept Z as timezone on Python 3.9–3.10 (fixed in 3.11). The MOSMIX parser already had the re.sub(r'Z$', '+00:00', ...) workaround — applied the same pattern to CAP timestamp parsing for consistency across the supported Python range.

units.py — Document sentinel pattern in _find()

Added a docstring explaining that each condition mapping must end with a sentinel entry (value None) to avoid the last real entry "leaking" for codes above the highest key. This is a subtle invariant that's easy to break when adding new entries.

3. Performance

parsers.py — str.split() instead of re.split(r'\s+', ...) in MOSMIX

The comment says ~50% of parse time is spent in the value-splitting loop. str.split() (no args) splits on any whitespace and strips leading/trailing — identical behavior to re.split(r'\s+', s.strip()) but avoids regex overhead. This is a free micro-optimization in a hot loop.

4. Code quality

parsers.py — Remove broken data_length property from RadarParser

RadarParser.data_length references self.BYTES_PER_PIXEL which is defined on RADOLANParser, not RadarParser (they don't share an inheritance chain). Calling it would raise AttributeError. The property is unused — RadarParser reads HDF5 files which are self-describing, so no byte-count check is needed.

parsers.py — Deduplicate _is_tag() into base Parser class

MOSMIXParser._is_tag(element, tag, ns) and CAPParser._is_tag(element, tag) were near-identical (only differing in whether ns was a parameter or self.ns). Moved the 3-arg version to Parser as a @staticmethod. Updated CAPParser.parse_event() to pass self.ns explicitly.

scripts/benchmark.py — Fix dead method reference

StationIDConverter.update doesn't exist — the method is called load. The benchmark was silently assigning a lambda to a nonexistent attribute.

5. Packaging

setup.py — Parse version from file instead of importing

import dwdparse at build time executes all module-level code (including importing parsers, units, stations). This can fail in clean environments where optional deps aren't installed yet, or cause unexpected side effects. Now uses a regex to extract __version__ from __init__.py.

setup.py + pyproject.toml — Bump minimum Python to 3.9

python_requires was >=3.8 but CI only tests 3.9–3.13. The walrus operator (:=) used throughout requires 3.8+ as a floor, but 3.8 reached EOL in October 2024 and is untested. Aligned the declared minimum with the CI matrix. Updated ruff target-version to match.

6. Tests (14 new)

Test	What it covers
`test_get_parser_unknown_filename`	`get_parser()` returns `None` for unrecognized files
`test_parse_raises_for_unknown_file`	`api.parse()` raises `ValueError`
`test_mosmix_sanitize_zero_values`	Zero values pass through sanitization unchanged
`test_mosmix_sanitize_negative_values`	Negative precip/wind/cloud are corrected
`test_mosmix_sanitize_wind_direction_modulo`	725° → 5° via modulo
`test_mosmix_sanitize_cloud_cover_overflow`	110% → clamped to 100%
`test_mosmix_sanitize_none_values`	`None` fields don't trigger sanitization
`test_current_observations_sanitize_boundary_values`	100% cloud/humidity and 3600s sunshine are valid
`test_current_observations_sanitize_overflow`	101%/3601s are rejected
`test_pressure_observations_approximation`	Barometric formula only triggers for `None`, not zero

Test plan

ruff check . — no issues
pytest — 40/40 pass (26 existing + 14 new)
Verified no breaking changes against brightsky's parser subclasses, fetch() usage, and station converter patching

Security & robustness: - Add 10s timeout to fetch() to prevent indefinite hangs - Replace all assert statements that validate external data with proper ValueError/RuntimeError exceptions (asserts are stripped under python -O) Correctness: - api.parse()/parse_url(): raise ValueError for unrecognized filenames instead of calling None() (TypeError) - MOSMIXParser.sanitize_records: use 'is not None' checks instead of truthiness to correctly handle zero values (0.0 precipitation, 0 cloud cover, etc.) - MOSMIXParser: fix wind direction correction to use modulo (% 360) instead of subtraction (- 360), which failed for values > 720 - PressureObservationsParser: use 'is None' instead of 'not' for pressure_msl check to avoid treating 0 as missing - CAPParser.sanitize_event: handle Z suffix in fromisoformat() for Python 3.9-3.10 compat (same workaround MOSMIX already had) - Document sentinel pattern in units._find() threshold mappings Performance: - Replace re.split(r'\s+', ...) with str.split() in MOSMIX value parsing (~50% of parse time is spent in this loop) Code quality: - Remove broken data_length property from RadarParser (references undefined BYTES_PER_PIXEL attribute from RADOLANParser) - Deduplicate _is_tag() into base Parser class as a static method - Fix dead reference in benchmark.py (update -> load) Packaging: - Parse __version__ from file instead of importing dwdparse at setup.py build time (avoids import-time side effects) - Bump python_requires to >=3.9 to match CI test matrix - Update ruff target-version to py39 accordingly Tests (14 new): - get_parser() with unknown filename returns None - api.parse() raises ValueError for unknown files - MOSMIXParser.sanitize_records: zero values, negative values, None values, wind direction modulo, cloud cover overflow - CurrentObservationsParser.sanitize_record: boundary and overflow - PressureObservationsParser: barometric approximation trigger

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden validation, fix correctness issues, improve test coverage#6

Harden validation, fix correctness issues, improve test coverage#6
rhaist wants to merge 1 commit intojdemaeyer:masterfrom
rhaist:fix/code-review-hardening

rhaist commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhaist commented Mar 23, 2026

Apology and context

Changes by category

1. Security & robustness

2. Correctness

3. Performance

4. Code quality

5. Packaging

6. Tests (14 new)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant