Spark#

All DB connection classes (Clickhouse, Greenplum, Hive and others) and all FileDF connection classes (SparkHDFS, SparkLocalFS, SparkS3) require Spark to be installed.

Installing Java#

Firstly, you should install JDK. The exact installation instruction depends on your OS, here are some examples:

yum install java-1.8.0-openjdk-devel  # CentOS 7 + Spark 2
dnf install java-11-openjdk-devel  # CentOS 8 + Spark 3
apt-get install openjdk-11-jdk  # Debian-based + Spark 3

Compatibility matrix#

Spark

Python

Java

Scala

2.3.x

3.7 only

8 only

2.11

2.4.x

3.7 only

8 only

2.11

3.2.x

3.7 - 3.10

8u201 - 11

2.12

3.3.x

3.7 - 3.10

8u201 - 17

2.12

3.4.x

3.7 - 3.12

8u362 - 20

2.12

3.5.x

3.8 - 3.12

8u371 - 20

2.12

Installing PySpark#

Then you should install PySpark via passing spark to extras:

pip install onetl[spark]  # install latest PySpark

or install PySpark explicitly:

pip install onetl pyspark==3.5.0  # install a specific PySpark version

or inject PySpark to sys.path in some other way BEFORE creating a class instance. Otherwise connection object cannot be created.

Injecting Java packages#

Some DB and FileDF connection classes require specific packages to be inserted to CLASSPATH of Spark session, like JDBC drivers.

This is usually done by setting up spark.jars.packages option while creating Spark session:

# here is a list of packages to be downloaded:
maven_packages = (
    Greenplum.get_packages(spark_version="3.2")
    + MySQL.get_packages()
    + Teradata.get_packages()
)

spark = (
    SparkSession.builder.config("spark.app.name", "onetl")
    .config("spark.jars.packages", ",".join(maven_packages))
    .getOrCreate()
)

Spark automatically resolves package and all its dependencies, download them and inject to Spark session (both driver and all executors).

This requires internet access, because package metadata and .jar files are fetched from Maven Repository.

But sometimes it is required to:

  • Install package without direct internet access (isolated network)

  • Install package which is not available in Maven

There are several ways to do that.

Using spark.jars#

The most simple solution, but this requires to store raw .jar files somewhere on filesystem or web server.

  • Download package.jar files (it’s usually something like some-package_1.0.0.jar). Local file name does not matter, but it should be unique.

  • (For spark.submit.deployMode=cluster) place downloaded files to HDFS or deploy to any HTTP web server serving static files. See official documentation for more details.

  • Create Spark session with passing .jar absolute file path to spark.jars Spark config option:

jar_files = ["/path/to/package.jar"]

# do not pass spark.jars.packages
spark = (
    SparkSession.builder.config("spark.app.name", "onetl")
    .config("spark.jars", ",".join(jar_files))
    .getOrCreate()
)

Using spark.jars.repositories#

Note

In this case Spark still will try to fetch packages from the internet, so if you don’t have internet access, Spark session will be created with significant delay because of all attempts to fetch packages.

Can be used if you have access both to public repos (like Maven) and a private Artifactory/Nexus repo.

  • Setup private Maven repository in JFrog Artifactory or Sonatype Nexus.

  • Download package.jar file (it’s usually something like some-package_1.0.0.jar). Local file name does not matter.

  • Upload package.jar file to private repository (with same groupId and artifactoryId as in source package in Maven).

  • Pass repo URL to spark.jars.repositories Spark config option.

  • Create Spark session with passing Package name to spark.jars.packages Spark config option:

maven_packages = (
    Greenplum.get_packages(spark_version="3.2")
    + MySQL.get_packages()
    + Teradata.get_packages()
)

spark = (
    SparkSession.builder.config("spark.app.name", "onetl")
    .config("spark.jars.repositories", "http://nexus.mydomain.com/private-repo/")
    .config("spark.jars.packages", ",".join(maven_packages))
    .getOrCreate()
)

Using spark.jars.ivySettings#

Same as above, but can be used even if there is no network access to public repos like Maven.

  • Setup private Maven repository in JFrog Artifactory or Sonatype Nexus.

  • Download package.jar file (it’s usually something like some-package_1.0.0.jar). Local file name does not matter.

  • Upload package.jar file to private repository (with same groupId and artifactoryId as in source package in Maven).

  • Create ivysettings.xml file (see below).

  • Add here a resolver with repository URL (and credentials, if required).

  • Pass ivysettings.xml absolute path to spark.jars.ivySettings Spark config option.

  • Create Spark session with passing package name to spark.jars.packages Spark config option:

<ivysettings>
    <settings defaultResolver="main"/>
    <resolvers>
        <chain name="main" returnFirst="true">
            <!-- Use Maven cache -->
            <ibiblio name="local-maven-cache" m2compatible="true" root="file://${user.home}/.m2/repository"/>
            <!-- Use -/.ivy2/jars/*.jar files -->
            <ibiblio name="local-ivy2-cache" m2compatible="false" root="file://${user.home}/.ivy2/jars"/>
            <!-- Download all packages from own Nexus instance -->
            <ibiblio name="nexus-private" m2compatible="true" root="http://nexus.mydomain.com/private-repo/" />
        </chain>
    </resolvers>
</ivysettings>
script.py#
maven_packages = (
    Greenplum.get_packages(spark_version="3.2")
    + MySQL.get_packages()
    + Teradata.get_packages()
)

spark = (
    SparkSession.builder.config("spark.app.name", "onetl")
    .config("spark.jars.ivySettings", "/path/to/ivysettings.xml")
    .config("spark.jars.packages", ",".join(maven_packages))
    .getOrCreate()
)

Place .jar file to -/.ivy2/jars/#

Can be used to pass already downloaded file to Ivy, and skip resolving package from Maven.

  • Download package.jar file (it’s usually something like some-package_1.0.0.jar). Local file name does not matter, but it should be unique.

  • Move it to -/.ivy2/jars/ folder.

  • Create Spark session with passing package name to spark.jars.packages Spark config option:

maven_packages = (
    Greenplum.get_packages(spark_version="3.2")
    + MySQL.get_packages()
    + Teradata.get_packages()
)

spark = (
    SparkSession.builder.config("spark.app.name", "onetl")
    .config("spark.jars.packages", ",".join(maven_packages))
    .getOrCreate()
)

Place .jar file to Spark jars folder#

Note

Package file should be placed on all hosts/containers Spark is running, both driver and all executors.

Usually this is used only with either:
  • spark.master=local (driver and executors are running on the same host),

  • spark.master=k8s://... (.jar files are added to image or to volume mounted to all pods).

Can be used to embed .jar files to a default Spark classpath.

  • Download package.jar file (it’s usually something like some-package_1.0.0.jar). Local file name does not matter, but it should be unique.

  • Move it to $SPARK_HOME/jars/ folder, e.g. ^/.local/lib/python3.7/site-packages/pyspark/jars/ or /opt/spark/3.2.3/jars/.

  • Create Spark session WITHOUT passing Package name to spark.jars.packages

# no need to set spark.jars.packages or any other spark.jars.* option
# all jars already present in CLASSPATH, and loaded automatically

spark = SparkSession.builder.config("spark.app.name", "onetl").getOrCreate()

Manually adding .jar files to CLASSPATH#

Note

Package file should be placed on all hosts/containers Spark is running, both driver and all executors.

Usually this is used only with either:
  • spark.master=local (driver and executors are running on the same host),

  • spark.master=k8s://... (.jar files are added to image or to volume mounted to all pods).

Can be used to embed .jar files to a default Java classpath.

  • Download package.jar file (it’s usually something like some-package_1.0.0.jar). Local file name does not matter.

  • Set environment variable CLASSPATH to /path/to/package.jar. You can set multiple file paths

  • Create Spark session WITHOUT passing Package name to spark.jars.packages

# no need to set spark.jars.packages or any other spark.jars.* option
# all jars already present in CLASSPATH, and loaded automatically

import os

jar_files = ["/path/to/package.jar"]
# different delimiters for Windows and Linux
delimiter = ";" if os.name == "nt" else ":"
spark = (
    SparkSession.builder.config("spark.app.name", "onetl")
    .config("spark.driver.extraClassPath", delimiter.join(jar_files))
    .config("spark.executor.extraClassPath", delimiter.join(jar_files))
    .getOrCreate()
)