Imports daft data to a bigquery table in an idempotent fashion.
- Daft2BigQuery
- Setting up a bigquery table
- Deploying the cloud function
- Sending your first payload
- Setting up a scheduled run
- Data Studio
I did this through the UI and used the schema in deploy/bigquery_example_schema.json. There's probably a much neater way with the gcloud
CLI but a once-off isn't so bad.
The makefile has a publish
command to create a pubsub topic and deploy the code to a cloud function.
Specify the project ID by prepending the variable before running the command.
$ PROJECT_ID=<redacted> make publish
poetry export -f requirements.txt --output requirements.txt --without-hashes
gcloud functions deploy daft2bigquery --region europe-west1 --project <redacted> --runtime python38 --memory 256MB --entry-point execute_daft2bigquery --trigger-topic "trigger-daft2bigquery" --timeout 540s --max-instances 1
Deploying function (may take a while - up to 2 minutes)...⠼
For Cloud Build Stackdriver Logs, visit: https://console.cloud.google.com/logs/viewer?<redacted>
Deploying function (may take a while - up to 2 minutes)...done.
availableMemoryMb: 256
buildId: <redacted>
entryPoint: execute_daft2bigquery
eventTrigger:
eventType: google.pubsub.topic.publish
failurePolicy: {}
resource: projects/<redacted>/topics/trigger-daft2bigquery
service: pubsub.googleapis.com
ingressSettings: ALLOW_ALL
labels:
deployment-tool: cli-gcloud
maxInstances: 1
name: projects/<redacted>/locations/europe-west1/functions/daft2bigquery
runtime: python38
serviceAccountEmail: <redacted>@appspot.gserviceaccount.com
status: ACTIVE
timeout: 540s
updateTime: '<redacted>'
versionId: '<redacted>'
Published
After deployment, you can visit the function in the console and go to the Testing
tab. There's an example of a payload in deploy/pubsub_payload.json and example_pubsub_payload.json.
There's a makefile command add_job
that will add the contents of deploy/pubsub_payload.json to the pubsub queue which will trigger the cloud function.
$ PROJECT_ID=<redacted> make add_job
gcloud pubsub topics publish "projects/<redacted>/topics/trigger-daft2bigquery" --message='{ "bq_table_id": "daft_housing_data.sales_data", "locations": ["ireland"], "search_type": "property-for-sale", "max_pages": 25, "page_offset": 26 }'
messageIds:
- '2027916082568790'
Cloud Scheduler is a neat way to automate time-based cloud functions. Use the payload example from above.
The main downside to my approach is that it produces a large number of duplicated entries, which have to be deduplicated if you want to graph it.
I've used the following connection SQL to take the most recent entry for a listing:
SELECT *
FROM daft_housing_data.sales_data
WHERE entryDate IN (
SELECT MAX(entryDate)
FROM daft_housing_data.sales_data
GROUP BY id
)