Harden validation, fix correctness issues, improve test coverage#6
Open
rhaist wants to merge 1 commit intojdemaeyer:masterfrom
Open
Harden validation, fix correctness issues, improve test coverage#6rhaist wants to merge 1 commit intojdemaeyer:masterfrom
rhaist wants to merge 1 commit intojdemaeyer:masterfrom
Conversation
Security & robustness: - Add 10s timeout to fetch() to prevent indefinite hangs - Replace all assert statements that validate external data with proper ValueError/RuntimeError exceptions (asserts are stripped under python -O) Correctness: - api.parse()/parse_url(): raise ValueError for unrecognized filenames instead of calling None() (TypeError) - MOSMIXParser.sanitize_records: use 'is not None' checks instead of truthiness to correctly handle zero values (0.0 precipitation, 0 cloud cover, etc.) - MOSMIXParser: fix wind direction correction to use modulo (% 360) instead of subtraction (- 360), which failed for values > 720 - PressureObservationsParser: use 'is None' instead of 'not' for pressure_msl check to avoid treating 0 as missing - CAPParser.sanitize_event: handle Z suffix in fromisoformat() for Python 3.9-3.10 compat (same workaround MOSMIX already had) - Document sentinel pattern in units._find() threshold mappings Performance: - Replace re.split(r'\s+', ...) with str.split() in MOSMIX value parsing (~50% of parse time is spent in this loop) Code quality: - Remove broken data_length property from RadarParser (references undefined BYTES_PER_PIXEL attribute from RADOLANParser) - Deduplicate _is_tag() into base Parser class as a static method - Fix dead reference in benchmark.py (update -> load) Packaging: - Parse __version__ from file instead of importing dwdparse at setup.py build time (avoids import-time side effects) - Bump python_requires to >=3.9 to match CI test matrix - Update ruff target-version to py39 accordingly Tests (14 new): - get_parser() with unknown filename returns None - api.parse() raises ValueError for unknown files - MOSMIXParser.sanitize_records: zero values, negative values, None values, wind direction modulo, cloud cover overflow - CurrentObservationsParser.sanitize_record: boundary and overflow - PressureObservationsParser: barometric approximation trigger
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Apology and context
First off — sorry for the size of this PR. I sat down to do a thorough code review of dwdparse (we use it downstream via Bright Sky) and ended up finding a cluster of small-but-real issues that are easier to reason about as a single coherent diff than as 15 separate PRs. Every change is explained below so you can evaluate each one independently. Nothing here changes the public API contract or output format.
I verified all changes against brightsky to ensure nothing breaks downstream (Bright Sky subclasses most parsers, uses
dwdparse.utils.fetch, and monkey-patches_converterin tests — all still work).Changes by category
1. Security & robustness
utils.py—fetch()now has a 10s default timeouturllib.request.urlopen()without a timeout will block indefinitely if a server hangs mid-response. Addedtimeout=10as a keyword argument so callers can override if needed. Bright Sky's usage (station list download) is well within 10s.parsers.py—assert→ proper exceptions for external data validationThere were ~15
assertstatements validating data from DWD files (product type, grid dimensions, byte counts, XML structure). Running Python with-Ostrips all assertions, silently turning these into no-ops. Replaced each withValueError(bad data) orRuntimeError(architecture mismatch), with descriptive error messages that include the expected vs actual values. This affectsMOSMIXParser._parse_stream,MOSMIXParser.parse_station,ObservationsParser.parse_records,RADOLANParser.parse_header,RADOLANParser.parse_data, andRadarParser.verify_meta.2. Correctness
api.py—parse()/parse_url()raiseValueErrorfor unrecognized filenamesPreviously, if
get_parser()returnedNone(no matching parser), both functions would callNone()→TypeError: 'NoneType' is not callable. Now they raise a clearValueErrorwith the filename.get_parser()itself still returnsNonefor no match (intentional — tested and used by Bright Sky).parsers.py—MOSMIXParser.sanitize_recordstruthiness bugThe sanitization checks used
if r['precipitation']which isFalsefor0.0. This meant zero precipitation, zero wind speed, zero cloud cover all skipped validation. Changed all guards tois not None. The existing behavior happened to be correct by coincidence (0 is always valid), but the intent was wrong and would mask bugs if DWD ever produced e.g.cloud_cover = 0.0alongside an impossible value in another field.parsers.py— Wind direction uses% 360instead of- 360The old fix
r['wind_direction'] -= 360only corrects values in the 360–720 range. A value of 725° would become 365° (still invalid). Changed to% 360which handles any value.parsers.py—PressureObservationsParserusesis Noneinstead ofnotif not elements['pressure_msl']treats 0 as missing, triggering the barometric approximation for a valid zero reading. Changed tois None. (Atmospheric pressure is never 0 Pa in practice, but the semantic was wrong.)parsers.py—CAPParser.sanitize_eventZ-suffix infromisoformat()datetime.fromisoformat()doesn't acceptZas timezone on Python 3.9–3.10 (fixed in 3.11). The MOSMIX parser already had there.sub(r'Z$', '+00:00', ...)workaround — applied the same pattern to CAP timestamp parsing for consistency across the supported Python range.units.py— Document sentinel pattern in_find()Added a docstring explaining that each condition mapping must end with a sentinel entry (value
None) to avoid the last real entry "leaking" for codes above the highest key. This is a subtle invariant that's easy to break when adding new entries.3. Performance
parsers.py—str.split()instead ofre.split(r'\s+', ...)in MOSMIXThe comment says ~50% of parse time is spent in the value-splitting loop.
str.split()(no args) splits on any whitespace and strips leading/trailing — identical behavior tore.split(r'\s+', s.strip())but avoids regex overhead. This is a free micro-optimization in a hot loop.4. Code quality
parsers.py— Remove brokendata_lengthproperty fromRadarParserRadarParser.data_lengthreferencesself.BYTES_PER_PIXELwhich is defined onRADOLANParser, notRadarParser(they don't share an inheritance chain). Calling it would raiseAttributeError. The property is unused —RadarParserreads HDF5 files which are self-describing, so no byte-count check is needed.parsers.py— Deduplicate_is_tag()into baseParserclassMOSMIXParser._is_tag(element, tag, ns)andCAPParser._is_tag(element, tag)were near-identical (only differing in whethernswas a parameter orself.ns). Moved the 3-arg version toParseras a@staticmethod. UpdatedCAPParser.parse_event()to passself.nsexplicitly.scripts/benchmark.py— Fix dead method referenceStationIDConverter.updatedoesn't exist — the method is calledload. The benchmark was silently assigning a lambda to a nonexistent attribute.5. Packaging
setup.py— Parse version from file instead of importingimport dwdparseat build time executes all module-level code (including importing parsers, units, stations). This can fail in clean environments where optional deps aren't installed yet, or cause unexpected side effects. Now uses a regex to extract__version__from__init__.py.setup.py+pyproject.toml— Bump minimum Python to 3.9python_requireswas>=3.8but CI only tests 3.9–3.13. The walrus operator (:=) used throughout requires 3.8+ as a floor, but 3.8 reached EOL in October 2024 and is untested. Aligned the declared minimum with the CI matrix. Updated rufftarget-versionto match.6. Tests (14 new)
test_get_parser_unknown_filenameget_parser()returnsNonefor unrecognized filestest_parse_raises_for_unknown_fileapi.parse()raisesValueErrortest_mosmix_sanitize_zero_valuestest_mosmix_sanitize_negative_valuestest_mosmix_sanitize_wind_direction_modulotest_mosmix_sanitize_cloud_cover_overflowtest_mosmix_sanitize_none_valuesNonefields don't trigger sanitizationtest_current_observations_sanitize_boundary_valuestest_current_observations_sanitize_overflowtest_pressure_observations_approximationNone, not zeroTest plan
ruff check .— no issuespytest— 40/40 pass (26 existing + 14 new)fetch()usage, and station converter patching