onETL#

Repo Status PyPI PyPI License PyPI Python Version Documentation Build Status Coverage pre-commit.ci

onETL logo

What is onETL?#

Python ETL/ELT library powered by Apache Spark & other open-source tools.

Goals#

  • Provide unified classes to extract data from (E) & load data to (L) various stores.

  • Provides Spark DataFrame API for performing transformations (T) in terms of ETL.

  • Provide direct assess to database, allowing to execute SQL queries, as well as DDL, DML, and call functions/procedures. This can be used for building up ELT pipelines.

  • Support different read strategies for incremental and batch data fetching.

  • Provide hooks & plugins mechanism for altering behavior of internal classes.

Non-goals#

  • onETL is not a Spark replacement. It just provides additional functionality that Spark does not have, and improves UX for end users.

  • onETL is not a framework, as it does not have requirements to project structure, naming, the way of running ETL/ELT processes, configuration, etc. All of that should be implemented in some other tool.

  • onETL is deliberately developed without any integration with scheduling software like Apache Airflow. All integrations should be implemented as separated tools.

  • Only batch operations, no streaming. For streaming prefer Apache Flink.

Requirements#

  • Python 3.7 - 3.12

  • PySpark 2.3.x - 3.5.x (depends on used connector)

  • Java 8+ (required by Spark, see below)

  • Kerberos libs & GCC (required by Hive, HDFS and SparkHDFS connectors)

Supported storages#

Type

Storage

Powered by

Database

Clickhouse

Apache Spark JDBC Data Source

MSSQL

MySQL

Postgres

Oracle

Teradata

Hive

Apache Spark Hive integration

Kafka

Apache Spark Kafka integration

Greenplum

VMware Greenplum Spark connector

MongoDB

MongoDB Spark connector

File

HDFS

HDFS Python client

S3

minio-py client

SFTP

Paramiko library

FTP

FTPUtil library

FTPS

WebDAV

WebdavClient3 library

Samba

pysmb library

Files as DataFrame

SparkLocalFS

Apache Spark File Data Source

SparkHDFS

SparkS3

Hadoop AWS library