- Create a GKE cluster and configure kubectl.
- Install Argo.
- Configure the defautl artifact repository.
Get the input data from this location. In the following, we assume that the file path is ./github-issues.zip
Decompress the input data:
unzip ./github-issues.zip
For debugging purposes, consider reducing the size of the input data. The workflow will execute much faster:
cat ./github-issues.csv | head -n 10000 > ./github-issues-medium.csv
Compress the data using gzip (this format is the one assumed by the workflow):
gzip ./github-issues-medium.csv
Upload the data to GCS:
gsutil cp ./github-issues-medium.csv.gz gs://<MY_BUCKET>
Build the container and tag it so that it can be pushed to a GCP container registry
docker build -f Dockerfile -t gcr.io/<GCP_PROJECT>/github_issue_summarization:v1 .
Push the container to the GCP container registry:
gcloud docker -- push gcr.io/<GCP_PROJECT>/github_issue_summarization:v1
Run the workflow:
argo submit github_issues_summarization.yaml
-p bucket=<BUCKET_NAME>
-p bucket-key=<PATH_TO_INPUT_DATA_IN_BUCKET>
-p container-image=gcr.io/<GCP_PROJECT>/github_issue_summarization:v1
Where:
- <BUCKET_NAME> is the name of a GCS bucket where the input data is stored (e.g.: "my_bucket_1234").
- <BUCKET_KEY> is the path to the input data in csv.gz format (e.g.: "data/github_issues.csv.gz").
- <GCP_PROJECT> is the name of the GCP project where the container was pushed.
The data generated by the workflow will be stored in the default artifact repository specified in the previous section.
The logs can be read by using the argo get and argo logs commands (link)