data-misc-tools

This project hosts several tools to help with development using Hive, Spark

To study and apply how to leverage new features of Hive, Spark, Hadoop for data process, I created and am developing some functions for Hive and spark.

Requirement

Spark 2.2 (in order to avoid version confliction between Spark and Hive, I recompiled spark without hive support)
Hive 2.3
Hadoop 3.0.0

1. Spark Runner

Spark Runner is inspired by SparkShell, It uses the same Scala Interpreter with SparkSession bound in Interpreter context, so it can:

interpret scala code from a file on hdfs(will extend to load scala/java/python from more source/storage)
compile the code into some scala function that takes parameters of current time, sparksession, arbitaray parameters such as some Rdd
run and process the data according the timing.

It will run at one minute interval periodically and detect the change of scala source, change of code will trigger scala interpreter to compile and update the ongoing scala function.

Basically a timer to run spark jobs without compiling, stopping, deploying, restarting spark process.

Add support to run hive sql script files stored on HDFS

I reuse the Hive Beeline instance and replace the input/output stream so that the spark runner can mimic beeline to run hive sql script, combined with my hive UDF, Addition to Hive ability to load data from hdfs files, It is also capable of:

loading data from http
sending data to http
pulling data from kafka according to time window (require support of kafka_2.11-0.11.0.0)
push data to kafka
read/set key/values to redis

To do, will add support for java/python source file

1.1 Simple Scheduler

Added simple annotation to manager task execution, such as task description, interval to run.

import org.apache.log4j.{LogManager, Logger}
import org.apache.spark.sql.SparkSession
import annotation.ProcDescription
import hive._
/**
  * Created by fan on 2018/04/10.
  */
@ProcDescription(description="One ring to rule them all", interval="PT1M")
class ProcessorLoader extends ((SparkSession, Any) => Any) {
    val log: Logger = LogManager.getLogger("ProcessorLoader.scala")
    override def apply(spark: SparkSession, param: Any): Any = {
         log.info("testing....")
         new HiveBeeLine(interval = "PT60M")(spark, "/user/tncdata/scripts/hive/test.sql")
    }
}
new ProcessorLoader

2. Hive functions

Regarding Hive UDF(User Defined Function), I hope to develop more functions to enable hive to connect/interact with many other data sources(as long as it provides data through some URI). Therefore this can become an option for ETL and different approach to apply machine learning algorithms to data in hive.

Some notes (might be wrong/insufficient)

Hive has 3 different types of UDF

GenericUDF This kind of UDF executes on each row given by sql, applies arbitrary business logic and returns the transformed data. It has initialize method invoked before the first row is passed into evaluate() method, however it does not get notified when the last row is passed. Therefore it is unsuitable for some cases such as using some clients to access data sources like Kafka, since it might not know when to close the client. Otherwise it is expensive to create a kafka client then close it for each row's execution.
GenericUDTF The output of UDTF is treated as a complete table, therefore in select query, the UDTF expression must be the only one term exclusively, it can be nested inside other functions. However its close() method is called. Therefore I develop t_zk_write(udf to write zookeeper) as UDTF, so it can create single zookeeper client and use it to write all the rows to zookeeper, kafka UDTFs are similar cases.
GenericUDAFResolver2 (I am still studying this)

HTTP UDF (get/post)

url is always string
timeout is always int, default by 3000
headers is a map<string, string>
content is string

type	name	parameters	return
udf	http_get	FUNC(url, timeout, headers) - send get request to url with headers in timeout	http code, headers, content
udf	http_post	FUNC(url, timeout, headers, content) - send post to url with headers in timeout	http code, headers, content
udf	url_encode	FUNC(string) - send get request to url with headers in timeout	encoded string
udtf	t_http_get	FUNC(url, timeout, headers) - send get request to url with headers in timeout	http code, headers, content
udtf	t_http_post	FUNC(url, timeout, headers, content) - send post to url with headers in timeout	http code, headers, content

Examples

This call of UDF http_get will need to open and close http client instance of each of 10 rows.

select city_name, http_get(printf("http://solr:8984/solr/user/select?fl=gender,age,nickname&q=city:'%s'&wt=json", city_name)) from p_city limit 10;

This call of UDTF t_http_get will only open one http client instance and share with all rows, then close it

select t_http_get(city_name, printf("http://solr:8984/solr/user/select?fl=gender,age,nickname&q=city:%s&wt=json", url_encode(city_name))) from p_city limit 30;

city_name is a context column to mark result of each row.

This one crawls data from git

select http_get(printf("https://api.github.com/orgs/%s", "apache"), 10000);

result is like:

_c0
{"code":200,"headers":{"Transfer-Encoding":"chunked","Status":"200 OK","Server":"GitHub.com"...},"content":"{"login":"apache","id":47359...}"}

This one using UDTF

select t_http_get("apache", printf("https://api.github.com/orgs/%s", "apache"), 10000);

gives output like:

code	headers	content	ctx
200	{"Transfer-Encoding":"chunked","Status":"200 OK","Server":"GitHub.com","Access-Control-Allow-Origin":"*"...}	{"login":"apache","id":47359,...}	apache

ZOOKEEPER UDF (read/write/delete, can't watch, sorry)

zkAddress is string like host1:port,host2:port...
timeout is int
pathToRead is string

type	name	parameters	return
udf	zk_read	zk_read(String zkAddress, int timeout, String...pathToReads...) recursively read data from zookeeper paths	struct of "p", "v", path and value
udf	zk_write	zk_write(String zkAddress, int timeout, Map pathAndValues) recursively write values paired with zookeeper paths	struct of "p", "v", path and old value
udf	zk_delete	zk_delete(String zkAddress, int timeout, String...pathToDeletes) recursively delete zookeeper paths	struct of "p", "v", path and old value
udtf	t_zk_write	t_zk_write(String zkAddress, int timeout, String...pathAndValues) recursively write values paired with zookeeper paths	struct of "p", "v", path and old value
udtf	t_zk_delete	t_zk_delete(Any context, String zkAddress, int timeout, Map<String, String> pathAndValues) recursively write values paired with zookeeper paths	struct of "p", "v", path and old value

For example, read from path /hbase/:

select explode(zk_read("some-server:2181,some-server:2181,some-server:2181",3000,"/hbase"));

got result like:

col
{"p":"/hbase","v":""}
{"p":"/hbase/table","v":""}
{"p":"/hbase/table/test","v":"?\u0000\u0000\u0000\u0014master:16000\u0000/<?!??\u0000PBUF\b\u0000"}
{"p":"/hbase/table/hbase:namespace","v":"?\u0000\u0000\u0000\u0014master:16000?\u001dA???T?PBUF\b\u0000"}
{"p":"/hbase/table/hbase:meta","v":"?\u0000\u0000\u0000\u0014master:16000???\u001e?lP?PBUF\b\u0000"}
...

Kafka (/default producer configurations/default consumer configuration/list topics/pull/push)

type	name	parameters	return
udf	m_add	m_add(map1, map2....) - Returns a combined map of N maps	a map
udf	kf_consumer_cfgs	kf_consumer_cfgs() - return the map containing default settings for kafka consumer	a map
udf	kf_producer_cfgs	kf_producer_cfgs() - return the map containing default settings for kafka producer	a map
udf	kf_topics	kf_topics() - read all topics from kafka by settings in map parameter (kf_consumer_cfgs)	an array of structs with properties of t(topic) and p(partition)
udf	kf_pull	kf_pull(map, start_date_str, end_date_str, topics...) - poll records from kafka by time windows and topics, it returns an array of structs with properties of t(topic), ct(creation time), k(key) and v(value)	an array of structs that has "t" for topic, "ct" for creation time, "k" for key, "v" for value
udtf	t_kf_push	t_kf_push(Any context, Map<String, String> map, String topic, String key, String value) - pushWithTransaction records to kafka topics, it returns an array records of structs with properties of t(topic), ct(creation time), k(key) and v(value)	an array of structs that has "t" for topic, "ct" for creation time, "k" for key, "v" for value

I am working on functions to pull from/push to kafka.

Example of ETL jobs using Hive SQL with extend UDF

use mns_tests;

CREATE TEMPORARY MACRO TP() "yyyy-MM-dd'T'HH:mm";
CREATE TEMPORARY MACRO default_now(time_str string) if(time_str IS NOT NULL, unix_timestamp(time_str, TP()), unix_timestamp());
create temporary macro su_strip(s string, t string) reflect('org.apache.commons.lang3.StringUtils', 'strip', s, t);

-- using t_http_get to load data from spark-post web service
-- the result is in form of json
drop table if exists temp_raw_events;
CREATE temporary TABLE temp_raw_events                        
as select t_http_get(
                 named_struct(
                     't', unix_timestamp(),
                     's', from_unixtime(default_now(query_end), TP()),
                     'e', from_unixtime(default_now(query_end)+3600,TP())
                 ),
                 printf('https://api.sparkpost.com/api/v1/message-events?from=%s&to=%s',
                     from_unixtime(default_now(query_end), TP()),
                     from_unixtime(default_now(query_end)+3600,TP())),
                   5000,
                   map('Authorization','spark-post-key',
                         'User-Agent','java-sparkpost/0.19 (eccac9f)',
                         'Content-Type','application/json')) from raw_events;

insert overwrite table raw_events select ctx.t as load_time, ctx.s as query_begin, ctx.e as query_end, content from temp_raw_events;
insert into table history_raw_events select ctx.t as load_time, ctx.s as query_begin, ctx.e as query_end, content from temp_raw_events;

drop table if exists temp_sp_events;
CREATE temporary TABLE temp_sp_events (
ev string
) stored as orc;

--ETL extract, transform, load
insert overwrite table temp_sp_events select evs.ev  from temp_raw_events raw
lateral view explode(
    split(
        su_strip(get_json_object(raw.content, '$.results'), '[]'),
        '(?<=\\}),(?=\\{)')) evs as ev;

--More specific ETL job....
insert into table sp_events (ev, time_stamp, type) 
select ev,
cast(unix_timestamp(substr(get_json_object(ev, '$.timestamp'), 0, 19), "yyyy-MM-dd'T'HH:mm:ss") * 1000 as timestamp)  as time_stamp,
get_json_object(ev, '$.type') as type
from temp_sp_events;

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data-process		data-process
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-misc-tools

Requirement

1. Spark Runner

1.1 Simple Scheduler

2. Hive functions

Some notes (might be wrong/insufficient)

HTTP UDF (get/post)

Examples

ZOOKEEPER UDF (read/write/delete, can't watch, sorry)

Kafka (/default producer configurations/default consumer configuration/list topics/pull/push)

Example of ETL jobs using Hive SQL with extend UDF

About

Releases

Packages

Languages

License

thenetcircle/data-misc-tools

Folders and files

Latest commit

History

Repository files navigation

data-misc-tools

Requirement

1. Spark Runner

1.1 Simple Scheduler

2. Hive functions

Some notes (might be wrong/insufficient)

HTTP UDF (get/post)

Examples

ZOOKEEPER UDF (read/write/delete, can't watch, sorry)

Kafka (/default producer configurations/default consumer configuration/list topics/pull/push)

Example of ETL jobs using Hive SQL with extend UDF

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages