Using docker to launch the hive server:
docker-compose -f ./docker/docker-compose.yaml up -d --force-recreate --remove-orphans
If the result query is large, you may need to increase the memory of the hive server (e.g. Set max memory to 8G):
docker-compose -f ./docker/docker-compose.yaml up -d -e SERVICE_OPTS="-Xmx8G" --force-recreate --remove-orphans
Connecting to the hive server:
docker exec -it hive beeline -u 'jdbc:hive2://localhost:10000/;transportMode=http;httpPath=cliservice'
Refer to apache/hive for detailed documentation.
In this repo, we set `hive.default.fileformat`
as `Vineyard`
and set `hive.metastore.warehouse.dir`
as
```vineyard:///user/hive/warehouse```(In java/hive/conf/hive-size.xml), so the default storage format of hive is vineyard.
If you want to use local file system or HDFS, you need to change the configuration or point out the storage format when
creating table.
Create table as textfile and insert some data:
show tables; create table hive_example( a string, b int) stored as TEXTFILE location "file:///opt/hive/data/warehouse/hive_example"; insert into hive_example values('a', 1), ('a', 2), ('b',3); select count(distinct a) from hive_example; select sum(b) from hive_example;
Inspect the store file of hive table:
describe formatted hive_example;
Start vineyard server:
The socket file must be placed in the correct directory. Please refer to the docker-compose.yml file for details. You can change the socket file path as you like and change the docker-compose.yml file accordingly.
vineyardd --socket=./vineyard/vineyard.sock --meta=local
Create hive table on vineyard:
create table hive_example( a string, b int); describe formatted hive_example; drop table hive_example;
Create table and select
create table hive_example2( field_1 string, field_2 int); insert into hive_example2 values('a', 1), ('b', 2), ('c', 3); select * from hive_example2; explain vectorization only select * from hive_example2; drop table hive_example2;
Vectorized input and output(Currently unavaliabe):
set hive.fetch.task.conversion=none; set hive.vectorized.use.vectorized.input.format=true; set hive.vectorized.use.row.serde.deserialize=false; set hive.vectorized.use.vector.serde.deserialize=true; set hive.vectorized.execution.enabled=true; set hive.vectorized.execution.reduce.enabled=true; set hive.vectorized.row.serde.inputformat.excludes=io.v6d.hive.ql.io.VineyardInputFormat; create table hive_example( field_1 int, field_2 bigint, field_3 boolean, field_4 string, field_5 double, field_6 float) row format serde "io.v6d.hive.ql.io.VineyardVectorizedSerDe" stored as INPUTFORMAT 'io.v6d.hive.ql.io.VineyardVectorizedInputFormat' OUTPUTFORMAT 'io.v6d.hive.ql.io.VineyardOutputFormat' LOCATION "vineyard:///opt/hive/data/warehouse/hive_example"; insert into hive_example values(1, 1, true, 'a', 1.0, 1.0), (2, 2, true, 'b', 2.0, 2.0), (3, 3, false, 'c', 3.0, 3.0); select * from hive_example1; explain vectorization select * from hive_example; insert into hive_example values(1, 1), (2, 2), (3,3); drop table hive_example;
Test large data sets:
The following sql statement reads the livejournal dataset (a 27 million line csv file) and stores it in vineyard. The dataset must be placed in the correct directory.
create table hive_example3( src_id int, dst_id int); create table hive_test_data_livejournal( src_id int, dst_id int ) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' stored as textfile; load data local inpath "file:///opt/hive/data/warehouse/soc-livejournal.csv" into table hive_test_data_livejournal; insert into hive_example3 select * from hive_test_data_livejournal; drop table hive_test_data_livejournal; select * from hive_example3;
Test static partition:
create table hive_static_partition( src_id int, dst_id int ) partitioned by (value int); insert into table hive_static_partition partition(value=666) values (3, 4); insert into table hive_static_partition partition(value=666) values (999, 2), (999, 2), (999, 2); insert into table hive_static_partition partition(value=114514) values (1, 2); select * from hive_static_partition; select * from hive_static_partition where value=666; select * from hive_static_partition where value=114514; drop table hive_static_partition;
Test dynamic partition:
create table hive_dynamic_partition_data( src_id int, dst_id int, year int) stored as TEXTFILE location "file:///opt/hive/data/warehouse/hive_dynamic_partition_data"; insert into table hive_dynamic_partition_data values (1, 2, 2018),(3, 4, 2018),(1, 2, 2017); create table hive_dynamic_partition_test ( src_id int, dst_id int )partitioned by(mounth int, year int); insert into table hive_dynamic_partition_test partition(mounth=1, year) select src_id,dst_id,year from hive_dynamic_partition_data; select * from hive_dynamic_partition_test; drop table hive_dynamic_partition_test; drop table hive_dynamic_partition_data;
Test all primitive types:
Now vineyard support to store tinyint, smallint, int, bigint, boolean, string, float, double, date, timestamp, binary and decimal.
create table test_all_primitive_types ( field_1 tinyint, field_2 smallint, field_3 bigint, field_4 int, field_5 double, field_6 float, field_7 string, field_9 varchar(10), field_10 char(10), field_8 binary, field_11 date, field_12 boolean, field_13 timestamp, field_14 decimal(6, 2) ); insert into test_all_primitive_types select tinyint(1), smallint(1), 42, bigint(1), double(2.0), float(1.0), 'hello world1!', 'hello world2!', 'hello world3!', cast('hello world4!' as binary), date('2023-12-31'), true, timestamp('2023-12-31 23:59:59'), cast(1234.56 as decimal); select * from test_all_primitive_types; drop table test_all_primitive_types;
Test nested types:
Now vineyard support to store array, map and struct.
CREATE TABLE nested_table ( field_1 map<int, array<struct<field_1:int, field_2:string>>> ); insert INTO nested_table select map( 42, array(named_struct('field_1', 1, 'field_2', 'hello'), named_struct('field_1', 2, 'field_2', 'world!'))); select * from nested_table; drop table nested_table;
Download hive 3.1.3, and unpack to somewhere:
wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
Configure Spark session:
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SparkSession
val conf = new SparkConf() conf.setAppName("Spark on Vineyard")
// use local executor for development & testing .setMaster("local[*]") // ensure all executor ready .set("spark.scheduler.minRegisteredResourcesRatio", "1.0")
- val spark = SparkSession
.builder() .config(conf) .config("hive.metastore.uris", "thrift://localhost:9083") .config("hive.metastore.sasl.enabled", "false") .config("hive.server2.authentication", "NOSASL") .config("hive.metastore.execute.setugi", "false") .config("spark.sql.hive.metastore.version", "3.1.3") .config("spark.sql.hive.metastore.jars", "path") .config("spark.sql.hive.metastore.jars.path", "/opt/apache-hive-3.1.3-bin/lib/*") .enableHiveSupport() .getOrCreate()spark.sql()
val sc: SparkContext = spark.sparkContext
Use the session:
spark.sql(".....") sc.stop()
Refer to Spark/Hive for detailed documentation.
### Prepare vineyard jars ```bash
# Currently, the vineyard jar cannot run directly on hive because of # dependency conflicts. You can run it temporarily by reverting to an # older version of guava (such as 14.0.1) dependent by vineyard. # This problem will be fixed in the future. cd v6d/java mvn clean package
### Build docker images ```bash
cd v6d/java/hive/docker ./build.sh
docker network create hadoop-network
### Start sql server for hive metastore ```bash
cd v6d/java/hive/docker/dependency/mysql docker-compose -f mysql-compose.yaml up -d # You can change the password in mysql-compose.yaml and hive-site.xml
cd v6d/build
# at terminal 1 ./bin/vineyardd --socket=~/vineyard_sock/0/vineyard.sock -rpc_socket_port=9601 --etcd_endpoint="0.0.0.0:2382"
# at terminal 2 ./bin/vineyardd --socket=~/vineyard_sock/1/vineyard.sock -rpc_socket_port=9602 --etcd_endpoint="0.0.0.0:2382"
# at terminal 3 ./bin/vineyardd --socket=~/vineyard_sock/2/vineyard.sock -rpc_socket_port=9603 --etcd_endpoint="0.0.0.0:2382"
# at terminal 4 ./bin/vineyardd --socket=~/vineyard_sock/metastore/vineyard.sock -rpc_socket_port=9604 --etcd_endpoint="0.0.0.0:2382"
# at terminal 5 ./bin/vineyardd --socket=~/vineyard_sock/hiveserver/vineyard.sock -rpc_socket_port=9605 --etcd_endpoint="0.0.0.0:2382"
### Copy vineyard jars to share dir ```bash
mkdir -p v6d/share cd v6d/java/hive # you can change share dir in docker-compose.yaml cp target/vineyard-hive-0.1-SNAPSHOT.jar ../../../share
### Run hadoop & hive docker images ```bash
cd v6d/java/hive/docker docker-compose -f docker-compose-distributed.yaml up -d
docker exec -it hive-hiveserver2 beeline -u "jdbc:hive2://hive-hiveserver2:10000" -n root