FileDF Reader#
- class onetl.file.file_df_reader.file_df_reader.FileDFReader(*, connection: BaseFileDFConnection, format: BaseReadableFileFormat, source_path: PurePathProtocol | None = None, df_schema: StructType | None = None, options: FileDFReaderOptions = FileDFReaderOptions(recursive=None))#
Allows you to read files from a source path with specified file connection and parameters, and return a Spark DataFrame.
Warning
This class does not support read strategies.
- Parameters:
- connection
BaseFileDFConnection
File DataFrame connection. See File DataFrame Connections section.
- format
BaseReadableFileFormat
File format to read.
- source_pathos.PathLike or str, optional, default:
None
Directory path to read data from.
Could be
None
, but only if you pass file paths directly torun
method- df_schema
pyspark.sql.types.StructType
, optional, default:None
Spark DataFrame schema.
- options
FileDFReaderOptions
, optional Common reading options.
- connection
Examples
Create reader to parse CSV files in local filesystem:
from onetl.connection import SparkLocalFS from onetl.file import FileDFReader from onetl.file.format import CSV local_fs = SparkLocalFS(spark=spark) reader = FileDFReader( connection=local_fs, format=CSV(delimiter=","), source_path="/path/to/directory", )
All supported options
from onetl.connection import SparkLocalFS from onetl.file import FileDFReader from onetl.file.format import CSV csv = CSV(delimiter=",") local_fs = SparkLocalFS(spark=spark) reader = FileDFReader( connection=local_fs, format=csv, source_path="/path/to/directory", options=FileDFReader.Options(recursive=False), )
- run(files: Iterable[str | os.PathLike] | None = None) DataFrame #
Method for reading files as DataFrame.
- Parameters:
- filesIterator[str | os.PathLike] | None, default
None
File list to read.
If empty, read files from
source_path
.
- filesIterator[str | os.PathLike] | None, default
- Returns:
- df
pyspark.sql.DataFrame
Spark DataFrame
- df
Examples
Read CSV files from directory
/path
:from onetl.connection import SparkLocalFS from onetl.file import FileDFReader from onetl.file.format import CSV csv = CSV(delimiter=",") local_fs = SparkLocalFS(spark=spark) reader = FileDFReader( connection=local_fs, format=csv, source_path="/path", ) df = reader.run()
Read some CSV files using file paths:
from onetl.connection import SparkLocalFS from onetl.file import FileDFReader from onetl.file.format import CSV csv = CSV(delimiter=",") local_fs = SparkLocalFS(spark=spark) reader = FileDFReader( connection=local_fs, format=csv, ) df = reader.run( [ "/path/file1.csv", "/path/nested/file2.csv", ] )
Read only specific CSV files in directory:
from onetl.connection import SparkLocalFS from onetl.file import FileDFReader from onetl.file.format import CSV csv = CSV(delimiter=",") local_fs = SparkLocalFS(spark=spark) reader = FileDFReader( connection=local_fs, format=csv, source_path="/path", ) df = reader.run( [ # file paths could be relative "/path/file1.csv", "/path/nested/file2.csv", ] )