Writing to Greenplum using DBWriter#

For writing data to Greenplum, use DBWriter with GreenplumWriteOptions.

Warning

Please take into account Greenplum <-> Spark type mapping.

Warning

It is always recommended to create table explicitly using Greenplum.execute instead of relying on Spark’s table DDL generation.

This is because Spark’s DDL generator can create columns with different types than it is expected.

Examples#

from onetl.connection import Greenplum
from onetl.db import DBWriter

greenplum = Greenplum(...)

df = ...  # data is here

writer = DBWriter(
    connection=greenplum,
    target="schema.table",
    options=Greenplum.WriteOptions(
        if_exists="append",
        # by default distribution is random
        distributedBy="id",
        # partitionBy is not supported
    ),
)

writer.run(df)

Interaction schema#

High-level schema is described in Prerequisites. You can find detailed interaction schema below.

Spark <-> Greenplum interaction during DBWriter.run()
https://www.plantuml.com/plantuml/svg/hLTjRzem4FxUNt7gXwcb7HBvoDQf548NZJaLfDiq8I6J668rECmnVP7sutU1G5x8wmeAA2Htz_XvxdosRmD5fTgil3F5bSVWNZ8crjvc1IiQA2RXyySlC5bJ-GIEvCzCda7qcVklSBvzStvuXlyKj_cQ2XMzYxnAJTWhipVAbvrQLbejAn9RCU64WprBZ_n0j21SLeXAiu0jLqoe8X97XaKyZWpBgDpSn1bhsowpdoukyM5iG6t5dwbYHUb8FmPCzZ5pkBg2BlHDGlI-LUXPvlfLxty3OmXMxswaW_xBc5WJMAvdYiuz5aSyTjz6JGDreoZ4jA98AJa6TpybHz-st8WBPv5M8uTjjWHt8vvOlN4ONky_MkOOXlXr5XSSrUTBBjGrsBxNWVsT1egzucsxtRw41_GqBCCaqAcFgvqILwkFIqw9IsAXXOyb9_dILU0lSZBeyW3mSdp1C6eu0QZdCUzwAlOZ2u3Ay9t3rbZrE0scuk_--f4bc4XVMHcEzPwb0_uEeZgqz37fZQA22i9GiqCeg4V_SMAGUw2E8raGt1oyPwuVg6xyT-rBzMag33BHKMY3MEQX_c4uDhzdqOkKFARM1PtQBhouMFM0-c8deHvu79LrMIWXAWa-YiWS7BmG33k2Bq3vKOJsL3m66vJ_xHfQBNBLQRKIThiqs99X3t5W8VrLj1gWVtOibHCvfrb6vQUFiBN0e6VrSjM0X5mpXoewyyLi7wDQgXYj2yG7Xqia1vBzsR101J1dwWLDuSWXW0yxjgZH0551IIg21QwAeQZU2tt3AXPSyQquyxUS2EsZvfEOygUgLUrTjQgbQrLhMgjQyxnhzVAkbULTArIMd4vPyYvbILfPqbHPqbnPKazPKgPiee_dJY94zdCwRnS9uKBG6nc_zVrk8LmM42bZMxjz9oTkcB04dwAGsoRWiKLouKaDii-bxUtQejZsHRX_60vm7m5XTd05yZWAB7OJUkzv2E-mU4ULj_ShKKxBLqvheDpHobxOJ6BC2kMqUjIqKceadnffG8tKdiqnPWKrKeyQEKzj88VOqGCNqv-pSKmMQdwBwS9WR3xaxoWgXfCzP-Iak35cLNOhCCd_O5R9liUi4ZTxlWWnJQ9NwTFGZLIuDJplO7yPcPDagJVCGD8gVMIxnNqm7jJ_0G==

Options#

pydantic model onetl.connection.db_connection.greenplum.options.GreenplumWriteOptions#

VMware’s Greenplum Spark connector writing options.

Note

You can pass any value supported by connector, even if it is not mentioned in this documentation.

The set of supported options depends on connector version. See link above.

Warning

Some options, like url, dbtable, server.*, pool.*, etc are populated from connection attributes, and cannot be overridden by the user in WriteOptions to avoid issues.

Examples

Write options initialization

options = Greenplum.WriteOptions(
    if_exists="append",
    truncate="false",
    distributedBy="mycolumn",
)
field if_exists: GreenplumTableExistBehavior = GreenplumTableExistBehavior.APPEND (alias 'mode')#

Behavior of writing data into existing table.

Possible values:
  • append (default)

    Adds new rows into existing table.

    Behavior in details
    • Table does not exist

      Table is created using options provided by user (distributedBy and others).

    • Table exists

      Data is appended to a table. Table has the same DDL as before writing data.

      Warning

      This mode does not check whether table already contains rows from dataframe, so duplicated rows can be created.

      Also Spark does not support passing custom options to insert statement, like ON CONFLICT, so don’t try to implement deduplication using unique indexes or constraints.

      Instead, write to staging table and perform deduplication using execute method.

  • replace_entire_table

    Table is dropped and then created.

    Behavior in details
    • Table does not exist

      Table is created using options provided by user (distributedBy and others).

    • Table exists

      Table content is replaced with dataframe content.

      After writing completed, target table could either have the same DDL as before writing data (truncate=True), or can be recreated (truncate=False).

  • ignore

    Ignores the write operation if the table already exists.

    Behavior in details
    • Table does not exist

      Table is created using options provided by user (distributedBy and others).

    • Table exists

      The write operation is ignored, and no data is written to the table.

  • error

    Raises an error if the table already exists.

    Behavior in details
    • Table does not exist

      Table is created using options provided by user (distributedBy and others).

    • Table exists

      An error is raised, and no data is written to the table.