Databricks – Learnings from ELT Pipeline, Tips N Tricks

In this article I would like to share few nifty tricks, I came by with a view to highlight and help any one in a similar situation.

How do I push data from a spark data frame to MSSQL Server table ?
For this I would recommend using the following

Apache Spark Connector for SQL Server and Azure SQL (Release)

This is a very good and a seamless connection mechanism designed for Spark to MSSQL connectivity instead of relying on pyodbc. Using this connector the development of python scripts for data movement in Spark environment becomes quite easy and in-line as other connection methods

(df.write.mode('overwrite')
.format('com.microsoft.sqlserver.jdbc.spark')
.option("url", 'jdbc:sqlserver://hostname;databaseName=dbname;encrypt=true;trustServerCertificate=true;')
.option("dbtable", 'dbo.tablename')
.option("user", 'sqluser')
.option("password", 'sqlpwd')
.save())

How would I set my cluster time zone so that each session, I don’t have to set it manually ?
While configuring your cluster setup, at the lower half section one can set this spark config, so that each time cluster comes up the respective config is set for executions reflecting the setting

spark.sql.session.timeZone Australia/Sydney

How would I run a custom SQL Statement or Procedure in MSSQL if not using PYODBC ?
There is an embedded connection available to us in Databricks leveraging which one can execute statements of their choice in the respective SQL server

driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
con = driver_manager.getConnection('jdbc:sqlserver://hostname;databaseName=dbname;encrypt=true;trustServerCertificate=true;', 'sqluser', 'sqlpwd')

updsql = """
UPDATE dbo.tablename
SET ModifiedDateTime = GETDATE();
"""

stmt = con.createStatement()
stmt.execute(updsql)
stmt.close()

This way one does not need to import pyodbc explicitly to run custom statements back to the SQL Server

I am getting the following error “Can’t get JDBC type for void.”, How can I correct it ?
This error occurs when you are trying to write a Spark data frame to SQL server as a table from the Python Notebook. It seems in the community articles that this error is related to JDBC driver talking to SQL server and which cant interpret the multiple types of NULL’s associates with different types like int-null, varchar-null and more…

So to correct the above change the highlighted:

I am getting the following error “‘DataFrame’ object has no attribute ‘merge'” when I am trying to update delta table ?
During the development of ELT pipeline in some instances where one has to update the records in the delta table from a recently extracted dataframe once can incur the following error:

In order to resolve the above error, amend your target dataframe source code to the following:

And that’s it in this post.
Hope the above helps as a quick reference for any one facing the similar challenges.

Leave a Reply

Your email address will not be published. Required fields are marked *

*