Prerequisites#

Version Compatibility#

  • MongoDB server versions: 4.0 or higher

  • Spark versions: 3.2.x - 3.4.x

  • Scala versions: 2.12 - 2.13

  • Java versions: 8 - 20

See official documentation.

Installing PySpark#

To use MongoDB connector you should have PySpark installed (or injected to sys.path) BEFORE creating the connector instance.

See Spark installation instruction for more details.

Connecting to MongoDB#

Connection host#

It is possible to connect to MongoDB host by using either DNS name of host or it’s IP address.

It is also possible to connect to MongoDB shared cluster:

mongo = MongoDB(
    host="master.host.or.ip",
    user="user",
    password="*****",
    database="target_database",
    spark=spark,
    extra={
        # read data from secondary cluster node, switch to primary if not available
        "readPreference": "secondaryPreferred",
    },
)

Supported readPreference values are described in official documentation.

Connection port#

Connection is usually performed to port 27017. Port may differ for different MongoDB instances. Please ask your MongoDB administrator to provide required information.

Required grants#

Ask your MongoDB cluster administrator to set following grants for a user, used for creating a connection:

// allow writing data to specific database
db.grantRolesToUser("username", [{db: "somedb", role: "readWrite"}])
See: