data-architecture-a

(coco) #1

because no master data management solution has been implemented upstream of the data
warehouse. Therefore, to put together what appears to be “one version of the customer
record” and not double or triple count, algorithms are applied to bridge the keys together.


In the data vault landscape, we call this a hierarchical or same-as link, hierarchical if it
represents a multilevel hierarchy and same-as if it is a single hierarchy (parent to child
remap) of terms.


Placing these sequence numbers as business keys in hubs have the following issues:



  • They are meaningless—a human cannot determine what the key stands for (contextually) without
    examining the details for a moment in time.

  • They can change—often they do, even with something as “simple” as a source system upgrade—this
    results in a serious loss of traceability to the historical artifacts. Without an “old-key” to “new-key”
    map, there is no definitive traceability.

  • They can collide. Even though conceptually across the business there is one element called “customer
    account,” the same ID sequence may be assigned in different instances for different customer accounts.
    In this case, they should never be combined. An example of this would be two different implementations
    of SAP: one in Japan and one in Canada. Each assigns customer ID #1; however, in Japan's system, #1
    represents “Joe Johnson,” whereas in Canada's system, #1 represents “Margarite Smith.” The last thing
    you want in analytics is to “combine” these two records for reporting just because they have the same
    surrogate ID.


An additional question arises if the choice is made to utilize data vault sequence numbers
for hubs and the source system business keys are surrogates. The question is as follows:
why “rekey” or “renumber” the original business key? Why not just use the original
business key (which by the way is how the original hub is defined)?


To stop the collision (as put forward in the example above)—whether a surrogate
sequence, a hash key, or the source business key is chosen for the hub structure—another
element must be added. This secondary element ensures uniqueness of this surrogate
business key. One of the best practices here is to assign geography codes, for example,
JAP for any customer account IDs that originate from Japans’ SAP instance and CAN for
any customer account IDs that originate from Canadas’ SAP instance.


Multipart Source Business Keys


Using a geographic code, as mentioned above, brings up another issue. If the hub is
created based solely on source system business key (and not surrogate sequence or hash
key), then with the choice above (to add a geography code split), the model must be
designed and built with a multipart business key.


Chapter 6.2: Introduction to Data Vault Modeling
Free download pdf