What is Hive?

Apache Hive is an application that abstracts Hadoop data so that it can be accessed using an SQL-like language called HiveQL. Using HiveQL, you can use a familiar procedural language to query large amounts of data on the grid as if you were working with a relational database.

Hive offers a broad range of SQL semantics and integrates with both ODBC and JDBC interfaces, making it ideal for analyzing data.

Why Hive?

Hive is one of the fastest growing products for many reasons:

Accessing Hive

The diagram shows how a query made from the Hive CLI is transmitted to Hive, where it is translated into a MapReduce job that is run by Hadoop. Client applications use the ODBC/JDBC drivers to communicate with HiveServer2 to relay queries that like the queries from the CLI are converted into a MapReduce job that is executed on Hadoop.

Accessing Hive Diagram

Hive vs. Pig

The table below shows the difference between Hive and Pig, highlighting when and where each should be used, respective features, and available support.

Hive Pig
Where to Use Ad-hoc analytics and reporting ETL and pipeline data processing
Language SQL (declarative) PigLatin (procedural)
Schema/Types Mandatory (implicit) Optional (explicit)
Partitions Yes No, partition pruning with HCatalog
Complex Processing Not a good fit for complex processing Well suited where multi-query works with thousands of lines of Pig script
Client/Server Requires metastore server (HCatalog) and data registered with it Client only. Works with HCatalog metastore
ODBC/JDBC Yes, through HiveServer2 No
Tez Support Present and stable from Hive 0.13 Tez support under development (Pig 0.14)
ORC/Vectorization ORC and vectorization available ORC available with Pig 0.14, no vectorization yet
Transactions Yes (coming soon) No
Cost-Based Optimization Yes (coming soon) No

When to Use Hive vs. HBase

While it is reasonable to compare Pig and Hive, HBase and Hive serve very different purposes in the Hadoop ecosystem. The table below highlights the differences and when you would consider using each.

Hive HBase
Where to Use - Data warehousing and analytics on top of Hadoop/HDFS- - Does not fit frequent and/or record-level updates (although support is getting added for ACID transactions)- Query and analyze large volumes of data - Distributed key-value store for persistence and random access on HDFS- Build to support ten’s of thousands of reads/ writes per second at record level- Store and access values using keys
Access Primarily through Hive SQL Java and REST APIs
SQL Getting close to SQL standards Through Hive or Phoenix (SQL Skin on HBase) (not supported)
Integration - Integrated with Pig through HCatalog- Integrated with Oozie through support for Hive action and HCatalog partition notifications - Integrated with Hive for SQL support- Integrated with Pig (HBaseStorage) and Oozie (credential support)

Data Model

Hive data is organized into databases, tables, partitions, and buckets. Those familiar with SQL will be familiar with databases that use a namespace to organize a group of tables and tables that have a schema defining column data. Partitions allow you to create virtual columns based on keys that determine how data is stored. Users can identify rows of data with partitions to run queries on instead of running the queries across an entire data set. Buckets allow you to split partitions, allowing even more focused queries Skewed tables, like partitions, allow you to focus queries on a subset of the data set by splitting the data into separate files so that certain files can be skipped when executing a query.

The diagram below gives the general hierarchy of the data model and a general characteristic of each level. See Data Units for more detailed information.

Data Model in Hive

Hive and HCatalog

HCatalog, part of Hive project, is the central metastore for facilitating interoperability among various Hadoop tools. It not only acts as the table and storage management layer, so Pig, MapReduce, and Hive can share data, but also presents a relational view of the data in HDFS, abstracts where or in what format data is stored, and enables notifications of data availability.

Hive and HCatalog

HiveServer2

HiveServer2 is the JDBC/ODBC endpoint that Hive clients can use to communicate with Hive.

It supports the following:

HiveServer2