Recently our company embarked on a Journey to upgrade to Dynamics 365 F&O for their ERP operations. This resulted into chain of integration development of existing operational applications to hook up to the new D365 system via their OData leveraging Azure functions as intermediary to encapsulate the business logic.
Now the above model works for the applications with works with transactional data activities, like recording stocktake, PO updates or singular product information.
But when the requirements followed on to integrate the Dynamics 365 with a Data warehouse implemented in Databricks we came across few challenges in the beginning
- Export to Data Lake: This is a good feature to leverage which is like a CDC (Change Data Capture) inside the D365, hooked up to generate CDM (Common Data Model) in the Azure Data Lake Gen 2 configured in settings.
The drawback of this option I found was that its driven only by the tables being exported to data lake (not entities like a view on raw tables), hence the certain join logic were required to be deciphered
Note: Option “Choose using entities” is a convivence option to auto select the tables involved in a entity composition but still the option will export data at table level only.
- OData API: These are the perfect candidates for us to obtain the data from D365 as entities being exposed as OData Rest API. This options allows us to do edit the entities CRUD.
The drawback of this option came across when we started hitting entities with million plus data points which wasn’t incremental data, but dimensional. Now recurring download of million plus dataset via OData even with batch options was not going to complete in the near real time threshold of 10~15 mins refresh cycle.
Data Event Catalog to the rescue !!!
The above options was a time saver and a great quick win for us.
Now we can configure the Entity object change events to an Azure Event Grid and thus capturing the change events in a Azure Data Lake (Queue) real time as JSON messages.
And now for the initial load of the tables we have one time scheduled notebook (Python) in our Databricks setup where the initial data import (slow) is imported and persisted in as delta lake table (bronze) and post which we have incremental load notebooks which reads off this queue messages interpret them and as interpreted the selective data points are imported and updated in those large dimensional entities. Substantially improving our incremental runtimes and providing us the required outcome.
Post this process, our dbt pipe line kicks off, this running the transformational import from bronze to silver and to gold with the relevant test cases and documentation being generated and published automated via Databricks scheduled job.
The process overview now looks as follows for the incremental load now:
Happy to receive comments on the same for any improvement suggestions from the community.
Hope the above helps.