onETL#
What is onETL?#
Python ETL/ELT library powered by Apache Spark & other open-source tools.
Goals#
Provide unified classes to extract data from (E) & load data to (L) various stores.
Provides Spark DataFrame API for performing transformations (T) in terms of ETL.
Provide direct assess to database, allowing to execute SQL queries, as well as DDL, DML, and call functions/procedures. This can be used for building up ELT pipelines.
Support different read strategies for incremental and batch data fetching.
Provide hooks & plugins mechanism for altering behavior of internal classes.
Non-goals#
onETL is not a Spark replacement. It just provides additional functionality that Spark does not have, and improves UX for end users.
onETL is not a framework, as it does not have requirements to project structure, naming, the way of running ETL/ELT processes, configuration, etc. All of that should be implemented in some other tool.
onETL is deliberately developed without any integration with scheduling software like Apache Airflow. All integrations should be implemented as separated tools.
Only batch operations, no streaming. For streaming prefer Apache Flink.
Requirements#
Python 3.7 - 3.12
PySpark 2.3.x - 3.5.x (depends on used connector)
Java 8+ (required by Spark, see below)
Kerberos libs & GCC (required by
Hive
,HDFS
andSparkHDFS
connectors)
Supported storages#
Type |
Storage |
Powered by |
---|---|---|
Database |
Clickhouse |
Apache Spark JDBC Data Source |
MSSQL |
||
MySQL |
||
Postgres |
||
Oracle |
||
Teradata |
||
Hive |
Apache Spark Hive integration |
|
Kafka |
Apache Spark Kafka integration |
|
Greenplum |
VMware Greenplum Spark connector |
|
MongoDB |
||
File |
HDFS |
|
S3 |
||
SFTP |
||
FTP |
||
FTPS |
||
WebDAV |
||
Samba |
||
Files as DataFrame |
SparkLocalFS |
Apache Spark File Data Source |
SparkHDFS |
||
SparkS3 |
Hadoop AWS library |