popdensity

Accompanies the blog post "Querying geographic raster data in BigQuery the brute force way"

https://medium.com/@lakshmanok/querying-geographic-raster-data-in-bigquery-the-brute-force-way-1da46799d65f (this post stores it pixel-by-pixel: use convert_to_geo.py to replicate)

and

https://medium.com/google-cloud/how-to-query-geographic-raster-data-in-bigquery-efficiently-b178b1a5e723 (this post stores it as rectangles, and is much more efficient; use convert_to_rect.py to replicate)

Decide whether you want to load the NASA data or the SEDAC data. The NASA file is smaller (10km resolution) and is what I recommend if you just want to try things out. The SEDAC file is 1km resolution and much more accurate, so do that if you want to use this to make real decisions. Both datasets are global.

Option 1: NASA (low resolution)

Download a 3600x1800 resolution CSV file from https://neo.sci.gsfc.nasa.gov/view.php?datasetId=SEDAC_POP

Assuming you saved it as popdensity.csv, run:

  ./convert_to_rect.py --format=nasa --input popdensity.csv

Run the following script specifying a tablename in BigQuery (the dataset is hardcoded to advdata) and a staging bucket in GCS:
```
 ./load_geo_into_bq.sh  popdensity_nasa   staging-bucket
```

Try this query in BigQuery to find the most populated urban areas in Washington state:

WITH urban_populations AS (
    SELECT 
       lsad_name
       , SUM(ST_AREA(ST_INTERSECTION(bounds, urban_area_geom))/1000000) AS area_sqkm
       , COUNT(1) AS num_rectangles
       , AVG(population_density) AS pop_density
    FROM advdata.popdensity_nasa, `bigquery-public-data.geo_us_boundaries.urban_areas`
    WHERE ST_INTERSECTS(bounds, urban_area_geom) 
          AND STRPOS(lsad_name, ', WA') > 0
    GROUP BY lsad_name
)
SELECT 
       *, (area_sqkm * pop_density / 1000000) AS population_millions
FROM urban_populations
ORDER BY area_sqkm DESC
LIMIT 10

Option 2: SEDAC (high resolution)

Download 30-arcsec resolution ASCII file from https://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-density-rev11/data-download
Unzip the downloaded file. You will have 8 tiles with the extension .asc.
To load in the global data, run (rect is preferred):
```
   ./convert_to_[geo/rect].py --format=sedac --input gpw_*.asc
```
If you want data for just Australia, you could load up only the tile that contains Australia:
```
   ./convert_to_rect.py --format=sedac --input gpw_*_8.asc
```
If you want data for just the United States, you need only tiles 1 and 2, so you could do:
```
   ./convert_to_rect.py --format=sedac --input gpw_*_1.asc gpw_*_2.asc
```
Run the following script specifying a tablename in BigQuery (the dataset is hardcoded to advdata) and a staging bucket in GCS:
```
 ./load_geo_into_bq.sh  popdensity_sedac_[pixels/rectangles]   staging-bucket
```
Name the table popdensity_sedac_pixels if you ran convert_to_geo.py and popdensity_sedac_rectangles if you ran convert_to_rect.py

If you already have popdensity_sedac_rectangles, skip this step. If you ran convert_to_geo, then you got popdensity_sedac_pixels and you should run the following query in BigQuery to create rectangles (i.e. run-length encode the pixels within a tile that have the same population density):

CREATE OR REPLACE TABLE advdata.popdensity_sedac_rectangles AS
PARTITION BY RANGE_BUCKET(year, GENERATE_ARRAY(2000, 2100, 5))
CLUSTER BY tile
OPTIONS(require_partition_filter=True)

with dissolved AS (
       SELECT 
         year, tile, rowno, population_density,
         ST_UNION( ARRAY_AGG(bounds) ) AS bounds,
       FROM advdata.sedac_pixels
       GROUP BY year, tile, rowno, population_density
)

SELECT *, ST_CENTROID(bounds) AS location
FROM dissolved

It will take around 20 minutes.

Try this query in BigQuery to find the most populated urban areas in Washington state:

WITH urban_populations AS (
    SELECT 
       lsad_name
       , SUM(ST_AREA(ST_INTERSECTION(bounds, urban_area_geom))/1000000) AS area_sqkm
       , COUNT(1) AS num_rectangles
       , AVG(population_density) AS pop_density
    FROM advdata.popdensity_sedac_rectangles, `bigquery-public-data.geo_us_boundaries.urban_areas`
    WHERE tile = 'gpw_v4_population_density_rev11_2020_30_sec_1.asc'
          AND ST_INTERSECTS(bounds, urban_area_geom) 
          AND STRPOS(lsad_name, ', WA') > 0
    GROUP BY lsad_name
)
SELECT 
       *, (area_sqkm * pop_density / 1000000) AS population_millions
FROM urban_populations
ORDER BY area_sqkm DESC
LIMIT 10

Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
README.md		README.md
convert_to_geo.py		convert_to_geo.py
convert_to_rect.py		convert_to_rect.py
load_geo_into_bq.sh		load_geo_into_bq.sh
schema.json		schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

popdensity

popdensity

README.md

Option 1: NASA (low resolution)

Option 2: SEDAC (high resolution)

Files

popdensity

Directory actions

More options

Directory actions

More options

Latest commit

History

popdensity

Folders and files

parent directory

README.md

Option 1: NASA (low resolution)

Option 2: SEDAC (high resolution)