Programming Hive

Overview

This chapter is compiled from the Hive documentation from Hortonworks, Cloudera, and the book Programming Hive

Hive CLI

⚠️ Deprecation in favor of Beeline CLI. HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline, which is a JDBC client based on SQLLine. Due to new development being focused on HiveServer2, Hive CLI will soon be deprecated in favor of Beeline.

Start Beeline

beeline -u <database URL> -n <username> -p <password>

Print Configs Overridden by User or Hive.

set;

Prints Hadoop and Hive Configs

set -v;

Set Configurations

set <key>=<value>;

Add Resources to the Hadoop Distributed Cache

add FILE[S] <filepath> <filepath>*;
add JAR[S] <filepath> <filepath>*;
add ARCHIVE[S] <filepath> <filepath>*;

List Databases

SHOW DATABASES;

List Tables

SHOW TABLES;

Truncate Table

Through spark-shell

sql("TRUNCATE TABLE <db>.<table>");

Create a Table (Example)

CREATE TABLE IF NOT EXISTS celebrity (
    object_key String,
    name String
)
COMMENT 'Celebrity Info'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Describe a Table

DESCRIBE <tablename>;

For more information:

DESCRIBE EXTENDED <tablename>;

List Functions

`hive> show functions;`

Describe a Function

`hive> describe function <functionname>;`

HiveQL Basics

As the section indicates, we’re just going to look at some more commonly used HiveQL statements. See the Hive DDL Language Manual for the comprehensive HiveQL documentation.

Databases

Creating

Syntax

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
    [COMMENT database_comment]
    [LOCATION hdfs_path]
    [WITH DBPROPERTIES (property_name=property_value, ...)];

Example

CREATE DATABASE my_db IF NOT EXISTS my_db LOCATION '/tmp/my_databases/';

Dropping

Syntax

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

Example

DROP my_db IF EXISTS my_db;

Altering

Using the ALTER DATABASE command, you can set key-value pairs in the DBPROPERTIES associated with a database. No other metadata about the database can be changed, including its name and directory location.

Syntax

ALTER DATABASE database_name SET DBPROPERTIES (property_name=property_value, ...);

Example

ALTER DATABASE stock_portfolio SET DBPROPERTIES ('advisor' = 'Jason Jones');

Tables

Data Types

The following table lists the supported data types for columns:

Data Types	Example	Description
TINYINT	`20`	One-byte signed integer
SMALLINT	`327`	Two-byte signed integer
INT	`214748`	Four-byte signed integer
BIGINT	`9223372036`	Eight-byte signed integer
FLOAT	3.14	Single precision floating point
STRING	Strings are great	A sequence of characters. The character set can be specified. Single or double quotes can be used
BOOLEAN	`TRUE`	The boolean value of either `true` or `false`
TIMESTAMP	1327882394 (Unix epoch seconds), 1327882394.123456789 (Unix epoch seconds plus nanoseconds), and ‘2012-02-03 12:34:56.123456789’ (JDBC-compliant java.sql.Timestamp format)	Integer, float, or string
BINARY	–	An array of bytes
ARRAY	[‘pineapples’, ‘mangos’]	A collection of key-value tuples, where the fields are accessed using array notation (e.g., [‘key’])
MAP	(‘colors’, ‘7’, ‘patterns’, ‘10’)	A collection of key/value pairs or tuples defined as `Map<string, integer>`
STRUCT	{‘fname’:‘Tom’, ‘nickname’:‘Haymaker’, ‘lname’:‘Jiggins’}	Analogous to a C struct or an ‘object’. Fields can be accessed using the ‘dot’ notation. For example, if a column name is of type `STRUCT {first STRING; last STRING}`, then the first name field can be referenced using `name.first`

Viewing/Describing

To see all of the tables in a database::

SHOW TABLES;

You can also view the schema of a particular table::

DESCRIBE yourtable;

Creating

To create a table, you define a schema with the CREATE that specifies rows delimited by commas and the file type to store the data at the given location.

Syntax

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name data_type [COMMENT col_comment], ...)]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) [STORED AS DIRECTORIES] (Note: Only available starting with Hive 0.10.0)]
  [
   [ROW FORMAT row_format] [STORED AS file_format]
   | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  (Note: Only available starting with Hive 0.6.0)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]  (Note: Only available starting with Hive 0.6.0)
  [AS select_statement]

Example

In this statement, you are defining the schema for the table search and storing the data as an ORC File.

CREATE TABLE search (
    bcookie string, 
    time_stamp int, 
    yuid string, 
    ip string, 
    pg_spaceid string, 
...)
STORED AS orc;

When you drop a typical table, the raw data is lost because the directory corresponding to the table in warehouse is deleted. Thus, to retain the raw data for others to run queries, you should create "EXTERNAL" tables.

To create an external table, explicitly use the keyword EXTERNAL and point to the location of data while creating the tables. This will ensure that the data is not moved into a location inside the warehouse directory.

CREATE EXTERNAL TABLE search (
    bcookie string, 
    time_stamp int, 
    yuid string, 
    ip string, 
    pg_spaceid string, 
...)
STORED AS orc
location '/some-path';

Creating/Adding Partitions

In addition to creating columns for a table, you can create partitions, which allow you to efficiently find rows based on certain criteria.

Syntax

Creating a table with partitions::

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
    [(col_name data_type [COMMENT col_comment], ...)]
   [COMMENT table_comment]
   [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

Adding a partition to an existing table::

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...

Example

For example, you could partition the table based the criteria of the locale and datestamp to fetch BCookies from the “US” after the date “2014-01-01”.

CREATE EXTERNAL TABLE search (
    bcookie string, 
    time_stamp int, 
    yuid string, 
    ip string, 
    pg_spaceid string, 
...)
PARTITIONED BY (
    locale string, 
    datestamp string)
STORED AS orc
location '...';

You can also add partitions to existing tables with the ALTER key word

ALTER TABLE search ADD PARTITION ( locale=‘US', datestamp=‘20130201') 
LOCATION '...';

Reading Data

Reading data uses the SELECT key word and can include filters and functions. Because a SELECT statement can be complex, we’re going to just look at some examples and not give the syntax.

Selecting All Values

`SELECT * FROM	your_table;`

Selecting Columns

`SELECT column_a, column_b FROM your_table;`

Filtering Data

`WHERE`

The results from the query below are filtered based on the amount and region::

SELECT * FROM sales WHERE amount > 10 AND region = "US"

`HAVING`

SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10

`LIMIT`

Limit indicates the number of rows to be returned. The rows returned are chosen at random. The following query returns 5 rows from t1 at random.

SELECT * FROM t1 LIMIT 5

Grouping/Ordering/Sorting

`GROUP BY`

SELECT
    a,
    sum(b)
FROM
    t1
GROUP BY
    a;

`ORDER BY`

When using ORDER BY, data is passed through one reducer to create an ordered result set from all of the data. This may require an exceptionally long time to execute for larger data sets.

SELECT s.symbol, s.price_open, s.price_close
FROM stocks s
ORDER BY s.symbol ASC, s.price_open DESC;

`SORT BY`

SORT BY orders data within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted. The performance is better, but you get a local ordering instead of an ordering of all the data from each reducer.

SELECT s.symbol, s.price_open, s.price_close
FROM stocks s
SORT BY s.symbol ASC, s.price_open DESC;

Writing Data

Load Files Into Tables

Load operations are copy/move operations from HDFS or local filesystem that move datafiles into locations corresponding to Hive tables. File format must agree with the table format.

Syntax

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename 
[PARTITION (partcol1=val1, partcol2=val2 ...)];

Example

LOAD DATA LOCAL INPATH './data/nyse.txt' OVERWRITE INTO TABLE stocks;

Insert Data From a Query Into Tables

Query results can be inserted into tables of file system directories by using the insert clause.

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

Hive also supports multiple inserts in the same statement or dynamic partition inserts.

set hive.exec.dynamic.partition = true | false;
set hive.exec.dynamic.partition.mode = strict | nonstrict;

Alter Table Add Partitions

You can use ALTER TABLE ADD PARTITION to add partitions to a table. The location must be a directory inside of which data files reside.

ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1';

If new partitions are directly added to HDFS, Hive will not be aware of these. Thus, you will need to add the metadata about the partitions from HDFS into the Hive metastore with the following:

MSCK REPAIR TABLE table_name;

Integrating Hive with Oozie

In Oozie, a hive action runs a Hive job. The workflow job will wait until the Hive job completes before continuing to the next action. To run the Hive job, you have to configure the hive action with the job-tracker, name-node and Hive script elements as well as the necessary parameters and configuration

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="myfirsthivejob">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-traker>foo:8021</job-tracker>
            <name-node>bar:8020</name-node>
            <prepare>
                <delete path="${jobOutput}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
            </configuration>
            <script>myscript.q</script> 
            <param>InputDir=/home/tucu/input-data</param>
            <param>OutputDir=${jobOutput}</param>
        </hive>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

Integrating Hive With HBase

Hive defines the HBaseStorageHandler class to enable integration with HBase. This class has hooks for input/output formats and for metadata operations like CREATE, DELETE, etc.

Hive tables have columns and column types. In Hbase tables, this is the equivalent to column families and column qualifiers. Hive, however, does not need to include all the columns in an HBase table.

The following schema definition stores the data with the HBaseStorageHandler, maps the HBase columns with the key word SERDEPROPERTIES, and points to the HBase table with the key word TBLEPROPERTIES.

CREATE TABLE short_urls(
   short_url string,
   url string,
   hit_count int
)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key, u:url, s:hits")

TBLPROPERTIES
("hbase.table.name" = "short_urls");