If you are new to Trino, this tutorial is a rather in depth topic. We recommend you read the Intro to Trino for the Trinewbie blog to better understand Trino then read the Project Tardigrade Launch blog to learn more about the features that have been added.
In this tutorial, you will:
- Learn how to set up a Trino cluster with sufficient size to test the new Project Tardigrade features.
- Learn how to quickly provision Amazon EKS and S3 resources using helm charts and eksctl.
- Do some basic querying of the TPCH and TPCDS benchmark queries.
To verify if these are installed and which version run the following commands.
aws --version
eksctl version
kubectl version
helm version
Trino Tardigrade depends on a buffer layer that is currently implemented using an S3 compabtible storage service. In this example, we'll use AWS S3 and will need to have a user that configures the S3 bucket to store intermediate query results. This user will also be needed to deploy Trino on the EKS system. Follow the documentation on AWS to create an IAM User. This user must have full access to S3 buckets specified for spooling and for permanent storage as well as access to Glue
. For the EKS set up, this user must also have the IAM policies required by eksctl in order to work. In Total, the following AWS managed policies are required:
- AmazonEC2FullAccess
- AWSCloudFormationFullAccess
- IAMFullAccess
- AmazonS3FullAccess
- AWSGlueConsoleFullAccess
- EKSAllAccess (Note: Non-AWS Managed Policy that requires you to update account id)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "eks:*",
"Resource": "*"
},
{
"Action": [
"ssm:GetParameter",
"ssm:GetParameters"
],
"Resource": [
"arn:aws:ssm:*:<account_id>:parameter/aws/*",
"arn:aws:ssm:*::parameter/aws/*"
],
"Effect": "Allow"
},
{
"Action": [
"kms:CreateGrant",
"kms:DescribeKey"
],
"Resource": "*",
"Effect": "Allow"
},
{
"Action": [
"logs:PutRetentionPolicy"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
Once you have created the IAM user, the IAM service will generate an access key to identify the user, and a secret key that acts as a password. The secret key is available only upon creation so make sure to save it. An example of what these look like is AKIAIOSFODNN7EXAMPLE
and wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
. With these keys and the region you are running in, run the following config command in the aws cli.
$ aws configure
For more information, read the documentation about how to configure your keys for the cli.
To create your spool bucket in S3, you'll need to create a bucket name that is unique across the AWS regions in your partition. The easiest way to do this is create a UUID with prefix tardigrade-spool
and tardigrade-data
. For example, we would run the following command to create the bucket using a UUID generated in python.
TARDIGRADE_UUID=$(python -c 'import uuid; print(uuid.uuid4())')
aws s3 mb "s3://tardigrade-spool-${TARDIGRADE_UUID}"
aws s3 mb "s3://tardigrade-data-${TARDIGRADE_UUID}"
Now we need to configure the lifecycle bucket policy to expire abandoned objects in the spooler:
aws s3api put-bucket-lifecycle-configuration --bucket "tardigrade-spool-${TARDIGRADE_UUID}" --cli-input-json file://spool-lifecycle-rules.json
Create a values.yml
file by first copying the values-template.yaml
to a new file values.yml
and relacing placeholders in your values.yaml
:
<trino-version>
- latest release of Trino, e.g.:379
<number-of-workers>
- desired number of workers, e.g.:5
<spool-bucket-name>
- name of a bucket to be used for spooling, e.g.:s3://tardigrade-spool-${TARDIGRADE_UUID}
Multiple buckets can be specified using comma, e.g.:s3://bucket-1,s3://bucket-2
<aws-region>
- aws region where the cluster is being deployed, e.g.:us-east-1
<permanent-data-bucket>
- name of a bucket to be used for storing tables created by Trino, e.g.:s3://tardigrade-data-${TARDIGRADE_UUID}
<access-key>
,<secret-key>
- AWS access credentials.
With the AWS account and buckets created and the values.yaml set up with all of the placeholders filled in, it's time to run the EKS cluster deployment. Per the suggestions provided by the Tardigrade team on the blog, they recommend at least 1 coordinator node and 5 worker nodes, each of size m5.8xlarge
. Use the following command to create the cluster.
eksctl create cluster \
--name tardigrade-cluster \
--region us-east-1 \
--node-type m5.8xlarge \
--nodes 6
Notice: This uses the values.yaml
you just created.
If you haven't yet, add the Trino helm repository with the following command:
helm repo add trino https://trinodb.github.io/charts/
Now install Trino on the EKS cluster using Helm.
helm upgrade --install \
--values values.yaml \
tardigrade-kubernetes-cluster \
trino/trino \
--version 0.7.0
Now do a quick restart to make sure all nodes got the correct settings loaded.
kubectl rollout restart deployment \
tardigrade-kubernetes-cluster-trino-coordinator \
tardigrade-kubernetes-cluster-trino-worker
First, you'll need to forward the kubernetes port locally by running the following:
kubectl port-forward \
$(kubectl get pods --namespace default -l "app=trino,release=tardigrade-kubernetes-cluster,component=coordinator" -o jsonpath="{.items[0].metadata.name}") \
8080:8080
Download the Trino CLI or any other client that can run a JDBC Driver and point it to the local server at 127.0.0.1:8080
(This is the default url in the cli).
Congratutulations! You have now set up your first Tardigrade cluster! Now it's time to test the features! We're going to borrow the TPCH and TPCDS queries to run a long running query, then remove a node to cause it to fail.
CREATE SCHEMA hive.sf10
WITH ( location = 's3://tardigrade-data-4d0faa5d-5f92-4b6c-8c18-fef6d359a28b/sf10/');
CREATE TABLE hive.sf10.q1
AS
WITH
customer_total_return AS (
SELECT
"sr_customer_sk" "ctr_customer_sk"
, "sr_store_sk" "ctr_store_sk"
, "sum"("sr_return_amt") "ctr_total_return"
FROM
store_returns
, date_dim
WHERE ("sr_returned_date_sk" = "d_date_sk")
GROUP BY "sr_customer_sk", "sr_store_sk"
)
SELECT *
FROM
customer_total_return ctr1
, store
, customer
WHERE ("ctr1"."ctr_total_return" <> (
SELECT ("avg"("ctr_total_return") * DECIMAL '1.2')
FROM
customer_total_return ctr2
WHERE ("ctr1"."ctr_store_sk" = "ctr2"."ctr_store_sk")
))
AND ("s_store_sk" = "ctr1"."ctr_store_sk")
AND ("ctr1"."ctr_customer_sk" = "c_customer_sk")
While this is running, run the following command to scale down the number of worker nodes to crash one of the tasks manually.
kubectl scale deployment tardigrade-kubernetes-cluster-trino-worker --replicas=4
To remove the S3 buckets and all the data in them, run the following commands:
aws s3 rb "s3://tardigrade-spool-${TARDIGRADE_UUID}" --force
aws s3 rb "s3://tardigrade-data-${TARDIGRADE_UUID}" --force
To tear down the Tardigrade EKS cluster, run the following command:
eksctl delete cluster \
--name tardigrade-cluster \
--region us-east-1
- Queries may start failing with
software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate
:- If your workload is I/O intensive it is possible that S3 starts throttling
some requests. The request limits are enforced per bucket. It is
recommended to create an another bucket for spooling to allow
Trino to balance the load
exchange.base-directories=s3://<first-bucket>,s3://<second-bucket>
- If your workload is I/O intensive it is possible that S3 starts throttling
some requests. The request limits are enforced per bucket. It is
recommended to create an another bucket for spooling to allow
Trino to balance the load
- Queries may fail with
Task descriptor storage capacity has been exceeded
:- Trino has to maintain descriptors for each task in case it has to be restarted in an event of a failure. This information is currently stored in memory on the coordinator. We are planning to implement adaptive spilling for task descriptors, but at the moment there is a chance that queries may hit this limit. If this happens we recommend that you run less queries concurrently or use a coordinator with more memory.
See trademark and other legal notices.