Avro#

class onetl.file.format.avro.Avro(*, avroSchema: Dict | None = None, avroSchemaUrl: str | None = None, **kwargs)#

Avro file format. support_hooks

Based on Spark Avro file format.

Supports reading/writing files with .avro extension.

Version compatibility
  • Spark versions: 2.4.x - 3.5.x

  • Java versions: 8 - 20

  • Scala versions: 2.11 - 2.13

See documentation from link above.

Note

You can pass any option to the constructor, even if it is not mentioned in this documentation. Option names should be in camelCase!

The set of supported options depends on Spark version. See link above.

Examples

Describe options how to read from/write to Avro file with specific options:

from onetl.file.format import Avro
from pyspark.sql import SparkSession

# Create Spark session with Avro package loaded
maven_packages = Avro.get_packages(spark_version="3.5.0")
spark = (
    SparkSession.builder.appName("spark-app-name")
    .config("spark.jars.packages", ",".join(maven_packages))
    .getOrCreate()
)

# Describe file format
schema = {
    "type": "record",
    "name": "Person",
    "fields": [{"name": "name", "type": "string"}],
}
avro = Avro(schema_dict=schema, compression="snappy")
classmethod get_packages(spark_version: str | Version, scala_version: str | Version | None = None) list[str]#

Get package names to be downloaded by Spark. support_hooks

See Maven package index for all available packages.

Parameters:
spark_versionstr

Spark version in format major.minor.patch.

scala_versionstr, optional

Scala version in format major.minor.

If None, spark_version is used to determine Scala version.

Examples

from onetl.file.format import Avro

Avro.get_packages(spark_version="3.2.4")
Avro.get_packages(spark_version="3.2.4", scala_version="2.13")