What is Hive?
Apache Hive is an application that abstracts Hadoop data so that it can be accessed using an SQL-like language called HiveQL. Using HiveQL, you can use a familiar procedural language to query large amounts of data on the grid as if you were working with a relational database.
Hive offers a broad range of SQL semantics and integrates with both ODBC and JDBC interfaces, making it ideal for analyzing data.
Why Hive?
Hive is one of the fastest growing products for many reasons:
- Battle-Tested Standard - Hive is the SQL standard for Hadoop that has been around for many years.
- Single Solution - Hive works across a broad spectrum of data volumes and allows you to load, store, read, analyze, and combine data.
- Interoperability - HCatalog is a metastore that is part of Hive project. It allows Hive to work with Pig and HBase. Hive is also one of the few SQL solutions on Hadoop that has been widely certified by business intelligence (BI) vendors (10+ major vendors and growing).
- Strong Support - Hive is a top-level project (TLP) with the Apache Software Foundation (ASF). The Hive community is comprised of top-notch engineers and architects from well-known IT companies.
- Improving Performance - With each release, Hive’s performance improves and is closing the gap with Pig. This performance increase is due to the changes in Hive and the underlying execution engine that Hive 0.13 and 0.14 run on Tez. There is even more reason to be optimistic for improved performance due to the Stinger initiative, which is a broad, community-based effort to improve future versions of Hive.
Accessing Hive
The diagram shows how a query made from the Hive CLI is transmitted to Hive, where it is translated into a MapReduce job that is run by Hadoop. Client applications use the ODBC/JDBC drivers to communicate with HiveServer2 to relay queries that like the queries from the CLI are converted into a MapReduce job that is executed on Hadoop.
Hive vs. Pig
The table below shows the difference between Hive and Pig, highlighting when and where each should be used, respective features, and available support.
Hive | Pig | |
---|---|---|
Where to Use | Ad-hoc analytics and reporting | ETL and pipeline data processing |
Language | SQL (declarative) | PigLatin (procedural) |
Schema/Types | Mandatory (implicit) | Optional (explicit) |
Partitions | Yes | No, partition pruning with HCatalog |
Complex Processing | Not a good fit for complex processing | Well suited where multi-query works with thousands of lines of Pig script |
Client/Server | Requires metastore server (HCatalog) and data registered with it | Client only. Works with HCatalog metastore |
ODBC/JDBC | Yes, through HiveServer2 | No |
Tez Support | Present and stable from Hive 0.13 | Tez support under development (Pig 0.14) |
ORC/Vectorization | ORC and vectorization available | ORC available with Pig 0.14, no vectorization yet |
Transactions | Yes (coming soon) | No |
Cost-Based Optimization | Yes (coming soon) | No |
When to Use Hive vs. HBase
While it is reasonable to compare Pig and Hive, HBase and Hive serve very different purposes in the Hadoop ecosystem. The table below highlights the differences and when you would consider using each.
Hive | HBase | |
---|---|---|
Where to Use | - Data warehousing and analytics on top of Hadoop/HDFS- - Does not fit frequent and/or record-level updates (although support is getting added for ACID transactions)- Query and analyze large volumes of data | - Distributed key-value store for persistence and random access on HDFS- Build to support ten’s of thousands of reads/ writes per second at record level- Store and access values using keys |
Access | Primarily through Hive SQL | Java and REST APIs |
SQL | Getting close to SQL standards | Through Hive or Phoenix (SQL Skin on HBase) (not supported) |
Integration | - Integrated with Pig through HCatalog- Integrated with Oozie through support for Hive action and HCatalog partition notifications | - Integrated with Hive for SQL support- Integrated with Pig (HBaseStorage) and Oozie (credential support) |
Data Model
Hive data is organized into databases, tables, partitions, and buckets. Those familiar with SQL will be familiar with databases that use a namespace to organize a group of tables and tables that have a schema defining column data. Partitions allow you to create virtual columns based on keys that determine how data is stored. Users can identify rows of data with partitions to run queries on instead of running the queries across an entire data set. Buckets allow you to split partitions, allowing even more focused queries Skewed tables, like partitions, allow you to focus queries on a subset of the data set by splitting the data into separate files so that certain files can be skipped when executing a query.
The diagram below gives the general hierarchy of the data model and a general characteristic of each level. See Data Units for more detailed information.
Hive and HCatalog
HCatalog, part of Hive project, is the central metastore for facilitating interoperability among various Hadoop tools. It not only acts as the table and storage management layer, so Pig, MapReduce, and Hive can share data, but also presents a relational view of the data in HDFS, abstracts where or in what format data is stored, and enables notifications of data availability.
HiveServer2
HiveServer2 is the JDBC/ODBC endpoint that Hive clients can use to communicate with Hive.
It supports the following:
- concurrent clients
- secure clusters and encryption
- user/global/session configuration
- DoAs support allowing Hive queries to run as the requester