Skip to content

Commit

Permalink
STORM-539. Storm hive bolt and trident state.
Browse files Browse the repository at this point in the history
  • Loading branch information
harshach committed Feb 12, 2015
1 parent 8036109 commit 81772b2
Show file tree
Hide file tree
Showing 21 changed files with 3,148 additions and 2 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,6 @@ target
.*
!/.gitignore
_site
storm-core/dependency-reduced-pom.xml
dependency-reduced-pom.xml
derby.log
metastore_db
111 changes: 111 additions & 0 deletions external/storm-hive/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Storm Hive Bolt & Trident State

Hive offers streaming API that allows data to be written continuously into Hive. The incoming data
can be continuously committed in small batches of records into existing Hive partition or table. Once the data
is committed its immediately visible to all hive queries. More info on Hive Streaming API
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest

With the help of Hive Streaming API , HiveBolt and HiveState allows users to stream data from storm into hive directly.
To use Hive streaming API users need to create a bucketed table with ORC format. Example below

```code
create table test_table ( id INT, name STRING, phone STRING, street STRING) partitioned by (city STRING, state STRING) stored as orc tblproperties ("orc.compress"="NONE");
```


## HiveBolt

HiveBolt streams tuples directly into hive. Tuples are written using Hive Transactions.
Partiions to which HiveBolt will stream to can either created or pre-created or optionally
HiveBolt can create them if they are missing. Fields from Tuples are mapped to table columns.
User should make sure that Tuple filed names are matched to the table column names.

```java
DelimitedRecordHiveMapper mapper = new DelimitedRecordHiveMapper()
.withColumnFields(new Fields(colNames));
HiveOptions hiveOptions = new HiveOptions(metaStoreURI,dbName,tblName,mapper);
HiveBolt hiveBolt = new HiveBolt(hiveOptions);
```

### RecordHiveMapper
This class maps Tuple filed names to Hive table column names.
There are two implementaitons available


1) DelimitedRecordHiveMapper
2) JsonRecordHiveMapper

```java
DelimitedRecordHiveMapper mapper = new DelimitedRecordHiveMapper()
.withColumnFields(new Fields(colNames))
.withPartitionFields(new Fields(partNames));
or
DelimitedRecordHiveMapper mapper = new DelimitedRecordHiveMapper()
.withColumnFields(new Fields(colNames))
.withTimeAsPartitionField("YYYY/MM/DD");
```

|Arg | Description | Type
|--- |--- |---
|withColumnFields| field names in a tuple to be mapped to table column names | Fileds (required) |
|withPartitionFields| field names in a tuple can be mapped to hive table partitions | Fields |
|withTimeAsPartitionField| users can select system time as partition in hive table| String . Date format|

### HiveOptions

HiveBolt takes in HiveOptions as a constructor arg.

```java
HiveOptions hiveOptions = new HiveOptions(metaStoreURI,dbName,tblName,mapper)
.withTxnsPerBatch(10)
.withBatchSize(1000)
.withIdleTimeout(10)
```


HiveOptions params

|Arg |Description | Type
|--- |--- |---
|metaStoreURI | hive meta store URI (can be found in hive-site.xml) | String (required) |
|dbName | database name | String (required) |
|tblName | table name | String (required) |
|mapper| Mapper class to map Tuple field names to Table column names | DelimitedRecordHiveMapper or JsonRecordHiveMapper (required) |
|withTxnsPerBatch | Hive grants a *batch of transactions* instead of single transactions to streaming clients like HiveBolt.This setting configures the number of desired transactions per Transaction Batch. Data from all transactions in a single batch end up in a single file. Flume will write a maximum of batchSize events in each transaction in the batch. This setting in conjunction with batchSize provides control over the size of each file. Note that eventually Hive will transparently compact these files into larger files.| Integer . default 100 |
|withMaxOpenConnections| Allow only this number of open connections. If this number is exceeded, the least recently used connection is closed.| Integer . default 100|
|withBatchSize| Max number of events written to Hive in a single Hive transaction| Integer. default 15000|
|withCallTimeout| (In milliseconds) Timeout for Hive & HDFS I/O operations, such as openTxn, write, commit, abort. | Integer. default 10000|
|withHeartBeatInterval| (In seconds) Interval between consecutive heartbeats sent to Hive to keep unused transactions from expiring. Set this value to 0 to disable heartbeats.| Integer. default 240 |
|withAutoCreatePartitions| HiveBolt will automatically create the necessary Hive partitions to stream to. |Boolean. defalut true |
|withKerberosPrinicipal| Kerberos user principal for accessing secure Hive | String|
|withKerberosKeytab| Kerberos keytab for accessing secure Hive | String |



## HiveState

Hive Trident state also follows similar pattern to HiveBolt it takes in HiveOptions as an arg.

```code
DelimitedRecordHiveMapper mapper = new DelimitedRecordHiveMapper()
.withColumnFields(new Fields(colNames))
.withTimeAsPartitionField("YYYY/MM/DD");
HiveOptions hiveOptions = new HiveOptions(metaStoreURI,dbName,tblName,mapper)
.withTxnsPerBatch(10)
.withBatchSize(1000)
.withIdleTimeout(10)
StateFactory factory = new HiveStateFactory().withOptions(hiveOptions);
TridentState state = stream.partitionPersist(factory, hiveFields, new HiveUpdater(), new Fields());
```










143 changes: 143 additions & 0 deletions external/storm-hive/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<parent>
<artifactId>storm</artifactId>
<groupId>org.apache.storm</groupId>
<version>0.10.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
</parent>

<packaging>jar</packaging>
<artifactId>storm-hive</artifactId>
<name>storm-hive</name>
<developers>
<developer>
<id>harshach</id>
<name>Sriharsha Chintalapani</name>
<email>mail@harsha.io</email>
</developer>
</developers>

<dependencies>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>${project.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hive.hcatalog</groupId>
<artifactId>hive-hcatalog-streaming</artifactId>
<version>${hive.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>

<dependency>
<groupId>org.apache.hive.hcatalog</groupId>
<artifactId>hive-hcatalog-core</artifactId>
<version>${hive.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-cli</artifactId>
<version>${hive.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.calcite</groupId>
<artifactId>calcite-core</artifactId>
<version>0.9.2-incubating</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.googlecode.json-simple</groupId>
<artifactId>json-simple</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
<version>1.9.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.thrift</groupId>
<artifactId>libthrift</artifactId>
<version>0.9.0</version>
<scope>compile</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.2</version>
<executions>
<execution>
<goals>
<goal>test-jar</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>

</project>
Loading

0 comments on commit 81772b2

Please sign in to comment.