Open Source For You — December 2017

(Steven Felgate) #1
How To Admin

http://www.OpenSourceForU.com | OPEN SOURCE FOR YOU | DECEMBER 2017 | 43

The management of Big Data is crucial if enterprises are to benefit from the huge
volumes of data they generate each day. Hive is a tool built on top of Hadoop that
can help to manage this data.

Hive: The SQL-like


Data Warehouse Tool for Big Data


H


ive is a data warehouse infrastructure tool to process
structured data in Hadoop. It resides on top of Hadoop
to summarise Big Data, and makes querying and
analysing easy.
A little history about Apache Hive will help you
understand why it came into existence. When Facebook
started gathering data and ingesting it into Hadoop, the data
was coming in at the rate of tens of GBs per day back in 2006.
Then, in 2007, it grew to 1TB/day and within a few years
increased to around 15TBs/day. Initially, Python scripts were
written to ingest the data in Oracle databases, but with the
increasing data rate and also the diversity in the sources/types
of incoming data, this was becoming difficult. The Oracle

instances were getting filled pretty fast and it was time to
develop a new kind of system that handled large amounts of
data. It was Facebook that first built Hive, so that most people
who had SQL skills could use the new system with minimal
changes, compared to what was required with other RDBMs.
The main features of Hive are:
ƒ It stores schema in a database and processes data into HDFS.
ƒ It is designed for OLAP.
ƒ It provides an SQL-type language for querying, called
HiveQL or HQL.
ƒ It is familiar, fast, scalable and extensible.
Hive architecture is shown in Figure 1.
The components of Hive are listed in Table 1.
Free download pdf