Data vault modeling

The data vault modeling approach was invented by Dan Linstedt in 1990, and is used to model the enterprise data warehouse core layer. This approach is suitable for multi-source environments needing a fast adaptation to changes.

Curved lines on a blue background.

Modeling strategy

For the data vault approach, it is essential to identify the core business concepts and processes used for hubs and links. While the interpretation of the source systems is possible, the source system may not provide the truth about business reality. To ensure the longevity of the enterprise data warehouse, and reduce reliance on the source systems, the core of the model must be independent from the sources.

Hubs

Hub structures contain core business concepts for functional areas, and each hub contains a business key and a surrogate key for the internal data warehouse. A business key is a company-wide identifier that a company uses to separate the singletons for individual business concept, while a surrogate key is used for reference in the data vault model. Transactions and events must be modeled as independent hubs.

Links

Links are the associations of the hub surrogate keys. There is no limit to the number of referenced hubs as long as the association is a business relationship. The physical entity contains the various surrogate keys of the linked hubs, each in a separate column containing surrogate keys for associations. The concept of links offers many possibilities since the structures are independent from one another, and it can help define new relationships or meet new requirements. Furthermore, a data vault model allows you to create links without affecting the rest of the model.

Satellites

Satellites contain all the information that describes business keys or associations, and a hub or a link can have many satellites. The idea is to group attributes by source, or by change frequency. Each satellite entity contains a surrogate key associated with a hub's business key or a link. The load date is part of the satellite key, therefore each time an attribute change is loaded, a new row is inserted containing the loading date of the satellite insert.

Data integration and ETL

Integrating data from multiple source systems into a single analytical data solution has always been a daunting task. The strategy for data vaults is to find core business concepts, and natural business keys in source systems and ensure they match the business perspective. The same approach is used for relationships.

The flexibility of the data vault model derives from its ability to add objects without completely redesigning the existing structure, and its ability to use different satellites for the source system. This strategy allows tracking differences in the core data warehouse, but offloads the choice of where to source attributes to the data mart layer. For example, if you need to load a new source system into your data warehouse that also contains existing entities, you can add new satellites with different attributes to your model. A brief report comparing the two satellites would reveal discrepancies. For quality issues during an integration, a quality satellite can be added with the details of the integration issues.

The ETL jobs to load a data vault typically run in two steps:

  1. All hubs are loaded in parallel.
  2. All links and satellites are loaded in parallel.

The ETL operations for each type of objects are:

  • Hubs: the business key must appear only once in the Hub, and insert with a lookup on the business key.
  • Links: the association must appear only once in the link, and insert with a lookup on the association of surrogate key.
  • Satellites: the loading is the most complex, depending on the historization mode, either the changed attributes are updated, or a new version must be inserted.

Every object in the data vault approach stores two auditing attributes that provide highly detailed audit information:

  1. The first load date, and
  2. the first load source

Historization

In the data vault model, historization resides in the satellites, and it looks like Slowly Changing Dimensions (SCD), except the satellite's primary key is the hub's surrogate key and the load date of the new record. This historization approach is used by including facts, which are hubs with satellites.

As explained above, it is possible to have multiple satellites for each hub or link, grouped by the frequency of change. This multi-satellites strategy can be complicated when multiple satellites are historized, because of the regenerating of time slices within multiple historizations. In this scenario, a possible solution is called a "point-in-time" table, where a structure is placed between the hub and links, and the historized satellite stores the time slice and the keys. This means, each time there is a change in one of the satellites, a new record is created in the point-in-time table, linking each satellite (surrogate key for each satellite and the load date for each satellite) to the current time slice record.

Suggested reading

What is data modeling?

Smart data automation for data vault modeling

Future-proof your data with biGENIUS-X today.

Accelerate and automate your analytical data workflow with comprehensive features that biGENIUS-X offers.