onETL#

What is onETL?#

Python ETL/ELT library powered by Apache Spark & other open-source tools.

Provide unified classes to extract data from (E) & load data to (L) various stores.
Provides Spark DataFrame API for performing transformations (T) in terms of ETL.
Provide direct assess to database, allowing to execute SQL queries, as well as DDL, DML, and call functions/procedures. This can be used for building up ELT pipelines.
Support different read strategies for incremental and batch data fetching.
Provide hooks & plugins mechanism for altering behavior of internal classes.

onETL is not a Spark replacement. It just provides additional functionality that Spark does not have, and improves UX for end users.
onETL is not a framework, as it does not have requirements to project structure, naming, the way of running ETL/ELT processes, configuration, etc. All of that should be implemented in some other tool.
onETL is deliberately developed without any integration with scheduling software like Apache Airflow. All integrations should be implemented as separated tools.
Only batch operations, no streaming. For streaming prefer Apache Flink.

Type	Storage	Powered by
Database	Clickhouse	Apache Spark JDBC Data Source
	MSSQL
	MySQL
	Postgres
	Oracle
	Teradata
	Hive	Apache Spark Hive integration
	Kafka	Apache Spark Kafka integration
	Greenplum	VMware Greenplum Spark connector
	MongoDB	MongoDB Spark connector
File	HDFS	HDFS Python client
	S3	minio-py client
	SFTP	Paramiko library
	FTP	FTPUtil library
	FTPS	FTPUtil library
	WebDAV	WebdavClient3 library
	Samba	pysmb library
Files as DataFrame	SparkLocalFS	Apache Spark File Data Source
	SparkHDFS	Apache Spark File Data Source
	SparkS3	Hadoop AWS library