-
Notifications
You must be signed in to change notification settings - Fork 24
/
update_cdc_restricted
executable file
·126 lines (100 loc) · 4.88 KB
/
update_cdc_restricted
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
#!/bin/bash
# Define the bucket destinations
test_bucket="gs://msm-test-manual-data-bucket"
prod_bucket="gs://prod-manual-data-bucket"
# Define the name of the repository directory
cdc_repo_name="covid_case_restricted_detailed"
# Find the parent directory of the health-equity-tracker directory
parent_dir=$(dirname "$(dirname "$(dirname "$(realpath "$0")")")")
# Define the path to the covid_case_restricted_detailed repository
repo_path="$parent_dir/$cdc_repo_name"
# Clone or pull the repository
if [ ! -d "$repo_path" ]; then
echo "Repository directory not found. Cloning the repository..."
git clone https://github.com/cdc-data/covid_case_restricted_detailed.git "$repo_path" || {
echo "Failed to clone repository"
exit 1
}
else
echo "Repository directory found. Pulling latest changes..."
cd "$repo_path" || {
echo "Failed to navigate to repository directory"
exit 1
}
git pull origin master || {
echo "Failed to pull latest changes"
exit 1
}
fi
# Define the path to the data directory
data_dir="$repo_path/data"
# Check if the data directory exists
if [ ! -d "$data_dir" ]; then
echo "Data directory not found"
exit 1
fi
# Find the most recently created directory within the data_dir
# shellcheck disable=SC2012
most_recent_dir=$(ls -td "$data_dir"/*/ | head -n 1)
if [ -z "$most_recent_dir" ]; then
echo "No directories found in ${data_dir}"
exit 1
fi
# Navigate to the most recently created directory within the data_dir
cd "$most_recent_dir" || {
echo "Failed to navigate to most recent directory"
exit 1
}
# Unzip all zip files in the most recent directory
unzip -n '*.zip' || {
echo "Failed to unzip files"
exit 1
}
# Navigate to the health-equity-tracker directory
cd ../../../health-equity-tracker || {
echo "Failed to navigate to health-equity-tracker directory"
exit 1
}
# shellcheck disable=SC1091
source .venv/bin/activate || {
echo "Failed to activate virtual environment"
exit 1
}
# Run the local module to generate non-restricted files
python python/datasources/cdc_restricted_local.py -dir "$most_recent_dir" -prefix spark_part || {
echo "Failed to run cdc_restricted_local.py"
exit 1
}
# Upload CSV files to the test bucket
for csv_file in "$most_recent_dir"/cdc_restricted_by_*.csv; do
gsutil cp "$csv_file" "$test_bucket/" || {
echo "Failed to upload $csv_file to $test_bucket"
exit 1
}
echo "Uploaded $csv_file to $test_bucket"
done
# If the upload to the test bucket succeeded, upload to the prod bucket
for csv_file in "$most_recent_dir"/cdc_restricted_by_*.csv; do
gsutil cp "$csv_file" "$prod_bucket/" || {
echo "Failed to upload $csv_file to $prod_bucket"
exit 1
}
echo "Uploaded $csv_file to $prod_bucket"
done
echo "🙌 Done!
Next steps for you:
1. Make a new PR updating 'original_data_sourced' fields entries starting with 'cdc_restricted_data-' in frontend/src/data/config/DatasetMetadata.ts
- NOTE: the most recent data is from the previous month, so if you're updating from the CDC's May release, the new data will be sourced from April.
2. Run the cdc_restricted DAG from the HET Infra TEST Composer https://console.cloud.google.com/composer/environments?project=het-infra-test-05
- If you see Composer exists at that link, click Airflow, and then the play button for the cdc_restricted DAG.
- If Composer doesn't exist, you can restart it by adding an empty comment to any file inside python/ to the PR you just made. Then, merging that PR to main will automatically rebuild Composer on the test environment allowing Airflow usage. You can also force push your updated, local main branch to 'infra-test' to trigger.
- Once the DAG completes successfully, you should be able to view the updated data pipeline output in the test GCP project's BigQuery tables and also the exported .json files found in the GCP Buckets. Once you merge the PR from Step 1, the updated data should show on the dev site: https://dev.healthequitytracker.org/exploredata?mls=1.covid-3.00&group1=All#rates-over-time
3. If all looks good on staging, cut a new release: https://www.notion.so/healthequitytracker/Cut-and-Deploy-New-Release-to-Production-18f7e04e42f444ad83a5c857d4007090?pvs=4
- Make sure you select 'True' when asked if the release includes changes to the data pipeline and require Composer.
4. Run the cdc_restricted DAG from the HET Prod TEST Composer https://console.cloud.google.com/composer/environments?project=het-infra-prod-f6
5. Once DAG completes successfully, you should be able to see the results on the real production site: https://healthequitytracker.org/exploredata?mls=1.covid-3.00&group1=All#rates-over-time
6. Don't forget to turn off both Composers once you're done using them.
- https://github.com/SatcherInstitute/setup-cloud-platform/actions/workflows/cronDeleteComposer.yml
- https://github.com/SatcherInstitute/health-equity-tracker/actions/workflows/cronDeleteComposer.yml
"
exit 0