data-architecture-a

(coco) #1

child tables? Or relationships that are multiple levels deep? Then, the problem escalates
as the length of the load cycles escalates exponentially.



To be fair, let's now address some of the positive notions of utilizing sequence numbers.
Sequence numbers have the following positive impacts once established:



  • Small byte size (generally less than number(38)) (38 “9’s”) or 10 ∧ 125.

  • Process benefit: joins across tables can leverage small byte size comparisons.

  • Process benefit: joins can leverage numeric comparisons (faster than character or binary
    comparisons).

  • Always unique for each new record inserted.

  • Some engines can further partition (group) in ascending order the numerical sequences and leverage
    subpartition (micropartition) pruning by leveraging range selection during the join process (in parallel).


Hash Keys


What is a hash key? A hash key is a business key (may be composite fields) run through a
computational function called a hash and then assigned as the primary key of the table.
Hash functions are called deterministic. Being deterministic means that based on given
input X (every single time the hash function is provided X), it will produce output Y (for
the same input, the same output will be generated). Definitions of hash functions, what
they are and how they work, can be found on Wikipedia.


Hash key benefits to any data model:



  • 100% parallel independent load processes (if referential integrity is shut off) even if these load
    processes are split on multiple platforms or multiple locations.

  • Lazy joins—that is, the ability to join across multiple platforms utilizing technology like drill (or
    something similar)—even without referential integrity. Note that lazy joins can’t be accomplished
    across heterogeneous platform environments and aren’t even supported in some NoSQL engines.

  • Single field primary key attribute (same benefit here as the sequence numbering solution).

  • Deterministic—it can even be precomputed on the source systems or at the edge for IOT devices/edge
    computing.

  • Can represent unstructured and multistructured data sets—based on specific input hash keys can be
    calculated again and again (in parallel). In other words, a hash key can be constructed as a business key
    for audio, images, video, and documents. This is something sequences cannot do in a deterministic
    fashion.

  • If there is a desire to build a smart hash function, then meaning can be assigned to bits of the hash
    (similar to teradata—and what it computes for the underlying storage and data access).


Hash keys are important to Data Vault 2.0 because of the efforts to connect
heterogeneous data environments such as Hadoop and Oracle. Hash keys are also


Chapter 6.2: Introduction to Data Vault Modeling
Free download pdf