Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

[WIP] Github summarization workflow.

Prerequisites.

Get the input data and upload it to GCS.

Get the input data from this location. In the following, we assume that the file path is ./github-issues.zip

Decompress the input data:

unzip ./github-issues.zip

For debugging purposes, consider reducing the size of the input data. The workflow will execute much faster:

cat ./github-issues.csv | head -n 10000 > ./github-issues-medium.csv

Compress the data using gzip (this format is the one assumed by the workflow):

gzip ./github-issues-medium.csv

Upload the data to GCS:

gsutil cp ./github-issues-medium.csv.gz gs://<MY_BUCKET>

Building the container.

Build the container and tag it so that it can be pushed to a GCP container registry

docker build -f Dockerfile -t gcr.io/<GCP_PROJECT>/github_issue_summarization:v1 .

Push the container to the GCP container registry:

gcloud docker -- push gcr.io/<GCP_PROJECT>/github_issue_summarization:v1

Running the workflow.

Run the workflow:

argo submit github_issues_summarization.yaml
  -p bucket=<BUCKET_NAME>
  -p bucket-key=<PATH_TO_INPUT_DATA_IN_BUCKET>
  -p container-image=gcr.io/<GCP_PROJECT>/github_issue_summarization:v1

Where:

  • <BUCKET_NAME> is the name of a GCS bucket where the input data is stored (e.g.: "my_bucket_1234").
  • <BUCKET_KEY> is the path to the input data in csv.gz format (e.g.: "data/github_issues.csv.gz").
  • <GCP_PROJECT> is the name of the GCP project where the container was pushed.

The data generated by the workflow will be stored in the default artifact repository specified in the previous section.

The logs can be read by using the argo get and argo logs commands (link)