Supported File Formats

The following are the some file formats supported:

Hive also allows mixed formats:

Supported Compression Formats

The compression format is defined in io.compression.codecs.

The following are the supported formats:

Note: The compression codec Snappy (org.apache.hadoop.io.compress.SnappyCodec) is currently not supported.

Setting Compression

To set the intermediate compressions:

SET hive.exec.compress.intermediate = true;
SET mapreduce.map.output.compress.codec = <...>;

For output compressions:

SET hive.exec.compress.output = true;
SET mapreduce.output.fileoutputformat.compress.codec = <...>;

Text Files

You can load data from text files or store data into text files that are delimited.

When you create a table, you specify the row delimiters and that the table should be stored as a text file. In the example below, you create a table from a text file that is delimited by a comma:

CREATE TABLE u_data (
    id INT,
    fname STRING,
    lname STRING,
    gender STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Given the comma-delimited data in the CSV file actors_input.csv below:

1, Kevin, Bacon, male
2, Billy, Crystal, male
3, Sandra, Bullock, female
4, Demi, Moore, female

You could load into the table u_data with the following:

LOAD DATA LOCAL INPATH './actors_input.csv'
OVERWRITE INTO TABLE u_data; 

For text files that are not delimited, you can use regular expressions with SERDEPROPERTIES and the Java class org.apache.hadoop.hive.serde2.RegexSerDe. A typical use case for using regular expressions would be to load data from a Web log.

The example below parses information from an Apache Web log so that the text file can be loaded into a Hive table:

CREATE TABLE serde_regex(
    host STRING,
    identity STRING,
    user STRING,
    time STRING,
    request STRING,
    status STRING,
    size STRING,
    referer STRING,
    agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
    "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;

Optimized Row Columnar (ORC) File

What is ORC?

ORC is a file format first introduced in Hive 0.11 and designed specifically to improve the efficiency of storing Hive data. Using ORC files, Hive’s performance improves when reading, writing, and processing data.

File Structure

ORC files are structured as groups of row data and auxillary information known collectively as stripes. Each stripe contains index data, row data, and a footer. The default stripe size is 250 MB. The large size of stripes allow Hive to efficiently read data from HDFS.

At the end of the ORC file is a postscript holding compression parameters and the size of the compressed footer.

The following diagram shows the basic structure. See the description of the index, row data, and footer below the diagram.

BI on the Grid

How to Use ORC

To set the default file format as ORCFile, use the SET command:

SET hive.default.fileformat   = orc
SET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
INPUTFORMAT      'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

You can also specity that Hive store data as an ORC file when creating a table:

CREATE TABLE addresses ( 
    name 	string, 
    street 	string, 
    city 	string, 
    state 	string, 
    zip 	int 
) 
STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB");
LOCATION '/users/sumeetsi/orcfile';

Or, alter a table so that it uses ORCFile:

ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc

ORC File Configurations

There are a number of configurations that you can set in TBLEPROPERTIES. The table below lists the keys, the defaults, and a short description.

Key Default Description
orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy–needs evaluation)
orc.compress.size 262144 (256 KB) number of bytes in each compression chunk
orc.stripe.size 67108864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32 MB to cut down on disk I/O)
orc.row.index.stride 10000 number of rows between index entries (must be >= 1,000). A larger stride-size increases the probability of not being able to skip the stride, for a predicate
orc.create.index true whether to create row indexes. This is for predicate push-down (bloom-filters). If data is frequently accessed/filtered on a certain column, then sorting on the column and using index-filters makes column filters work faster

To set a ORC File configuration, you use TBLPROPERTIES when creating or altering a table as shown below, which sets the compression format to bzip2.

create table Addresses (
    name string,
    street string,
    city string,
    state string,
    zip int
) stored as orc tblproperties ("orc.compress"="bzip2");