Hi,
This post, is more about the design approach being adopted towards enterprise data warehousing recently.
So nothing much technical this time…
In recent days I came across the term “Data Vault” and I was keen to know more about it, hence i ventured into the territory of the same. And due to very limited re-sources available online on the topic has prompted me to write this article to shed more light into the same.
So back to business:
What is Data Vault: “Data Vault Modeling is a database modeling method that is designed to provide long-term historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that, apart from the modeling aspect, deals with issues such as auditing, tracing of data, loading speed and resilience to change.”
from : Wiki
As I have faced the problem of understanding the concept, just by reading the wiki wasn’t enough. So I will try to explain it in my own words.
(Excuse me !!! If I deviate in my words from the original ground work laid for the same)
Description:
Before progressing on further in this article I assume that you have read the wiki article about data vault and know the rules behind the Hubs, Links and Satellites, as further I will be building on top of them in the post.
From my experience in the past building an data warehouse on a green field project is always challenging, and is prone to many cycles of re-development. As data from multiple sources start integrating into one data warehouse, the key problems start to emerge are duplication of similar business data across different tables from multiple sources, multiple sources providing different set of attributes for similar business data, incrementally incorporating new data and more.
Example: Lets take an example of Stock Tickers
In any finance database if you maintain list of tickers, the sources can be Yahoo, Nasdaq, and many more stock exchanges across the world. Though the entity is one termed as “Ticker” but multiple sources will provide different set of data associated with the tickers, some more some less..
In the above image as you can see, the Yahoo doesn’t provide the data for Annualized Dividend, Yield and where as for Nasdaq source doesn’t provide Market-Cap.
Now later suppose new sources bring additional data, say for example google finance would bring me “Volume” for the tickers then with the above I would have to add new column to the table for the same. Which in a way is not that difficult nor it would disrupt/break the existing warehouse but yes the table structure will change.
And once table structures are changed (lucky if you don’t have schema-binded views/stored procedures else you wont be able to change your table structure also) you are in a question if other interacting systems are impacted or not who haven’t been using explicit column names in their operations.
Thus to me data vault is a technique of data warehousing in which the following steps play a key role for above similar situations
1. Identification of Hubs: It is a key process in building of data vault to first identify your hubs in your warehouse. These will form the pillars of the warehouse upon which you will establish upon your relationships representing the business.
2. Creation of Links: These tables with a good naming convention will clearly establish the relationship across your multiple hub’s, thus providing the logic.
3. Creation of Satellites: All the attributes further defining your hubs can be places in satellites. One hub can have multiple satellites.
And from the above we are able to achieve a relatively flexible architecture where new information is easy to incorporate, with the price of extra joins for information (but even those can be addressed by the smaller data marts emanating from your enterprise data warehouse) or with SuperNova models.
(seems with the names of data warehouse modelling we are going beyond stars…)
Well in summary as you may have experienced in this article, I have just scratched the surface of the topic and the topics spans much beyond it. But re-writing the same concepts is not I am aiming, hence I am providing you some good reads for the same which helped me understand the concept more thoroughly and leverage it at my work.
Good Reads
Good Read for Concept
Data Vault Wiki
Good Article for applying the concept to work, with advantages and dis-advantages
What is Data Vault ?
A very good description of data vault and its extension with super-nova models with cached views will give a great insight on the application of the concept and its usage in the real world.
Data Vault and Data Virtualization: Double Agility
Generate Data Vault by Tool ??
And lastly, in the note of the above technique. Many people are trying to find algorithmic ways to implement the data-vault based on the referential constraint in their existing transactional systems or legacy warehouses.
I have reviewed few algorithmic pseudo-code behind these, but to me nothing can replace the human designer for the same, as at many instances the referential integreties (or overdone) are incomplete and the logic interprets the schema to in-correct table type in data vault context.
So in summary my advise is to better design the data vault manually as it would yield more beneficial result and leveraging tools for automating the repetitive processes but the design/categorization of Hubs,Links and Satellites are better done with human designers rather than relying on algorithms based on keys and referential integreties.
Hope it helps.