Skip to content

Add Spark 4.0 support via deequ:2.0.14-spark-4.0#259

Open
m-aciek wants to merge 1 commit intoawslabs:masterfrom
m-aciek:spark-4-support
Open

Add Spark 4.0 support via deequ:2.0.14-spark-4.0#259
m-aciek wants to merge 1 commit intoawslabs:masterfrom
m-aciek:spark-4-support

Conversation

@m-aciek
Copy link
Copy Markdown

@m-aciek m-aciek commented Mar 26, 2026

Closes #258

Summary

  • Add "4.0": "com.amazon.deequ:deequ:2.0.14-spark-4.0" to SPARK_TO_DEEQU_COORD_MAPPING in configs.py
  • Widen pyspark optional dep from >=2.4.7,<3.4.0 to >=2.4.7,<5.0.0 in pyproject.toml
  • Replace scala.collection.JavaConversions (removed in Scala 2.13) with JavaConverters in scala_utils.py and profiles.py
  • Replace scala.collection.Seq.empty() (inaccessible via Py4J in Scala 2.13) with an empty Java list converted via to_scala_seq in analyzers.py and checks.py
  • Add Spark 4.0.0 to the CI matrix with Java 17; restructure matrix to use include: style so each Spark version carries its required Java version

Root causes fixed

Spark 4 uses Scala 2.13, which introduced two breaking changes affecting pydeequ:

  1. scala.collection.JavaConversions was removed — replaced by JavaConverters with explicit .asScala()/.asJava() calls
  2. scala.collection.Seq.empty() is not accessible via Py4J reflection — replaced with to_scala_seq(jvm, jvm.java.util.ArrayList()) which constructs an empty Scala Seq via the already-fixed converter

Test plan

  • All 99 existing tests pass with SPARK_VERSION=4.0.0 / pyspark==4.0.0
  • CI matrix extended to cover Spark 4.0.0 with Java 17
  • Existing Spark 3.x matrix entries unchanged

PR authored with assistance from Claude Code

@m-aciek
Copy link
Copy Markdown
Author

m-aciek commented Apr 10, 2026

This is now ready for review; CI tests pass on my fork: https://github.com/m-aciek/python-deequ/actions/runs/24196839467

Copy link
Copy Markdown
Contributor

@chenliu0831 chenliu0831 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm not sure if we would like to keep maintaining the Py4j approach though.

@chenliu0831
Copy link
Copy Markdown
Contributor

@m-aciek we need your commit to have verified signatures

- Add "4.0" entry to SPARK_TO_DEEQU_COORD_MAPPING in configs.py
- Widen pyspark optional dep bound to <5.0.0 in pyproject.toml
- Replace scala.collection.JavaConversions (removed in Scala 2.13) with
  JavaConverters in scala_utils.py and profiles.py
- Replace scala.collection.Seq.empty() (inaccessible via Py4J in Scala 2.13)
  with to_scala_seq(jvm, jvm.java.util.ArrayList()) in analyzers.py and checks.py
- Add Spark 4.0.0 to CI matrix with Java 17; use include: style to pair
  each Spark version with its required Java version
- Fix CI for Spark 4.0:
  - use Python 3.9 and version-marker pyspark dep
  - use pip install instead of poetry add
  - install pandas>=2.0.0 required by PySpark 4.0
- Fix empty Seq compatibility across Scala 2.12 and 2.13

Fixes awslabs#258
@m-aciek
Copy link
Copy Markdown
Author

m-aciek commented Apr 22, 2026

@chenliu0831 I've setup the verification and squashed the commits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Spark 4.0 support via deequ:2.0.14-spark-4.0

2 participants