gdal_config option doesn't work for threaded readers #186
Description
We already tried to solve this using apiroute_factory
https://developmentseed.org/titiler/concepts/APIRoute_and_environment_variables/
from fastapi import FastAPI, APIRouter
from rasterio._env import get_gdal_config
from titiler.custom.routing import apiroute_factory
from titiler.endpoints.factory import TilerFactory
app = FastAPI()
route_class = apiroute_factory({"GDAL_DISABLE_READDIR_ON_OPEN": "FALSE"})
router = APIRouter(route_class=route_class)
tiler = TilerFactory(router=router)
While this approach worked for a while, it didn't seems to work in python 3.8 and felt quite of a hack.
We then switched and added gdal_config
options in the factory (ref #170). I though this worked ... but I was wrong.
In FastAPI, when you define a function with a simple def myfunction(...
, fastapi will use starlette's run_in_threadpool
to run the function (so the API isn't blocked). This makes the function to be run in a thread ... which isn't the MainThread, and this is important. In Rasterio, when using with rasterio.Env(
block, rasterio will check if we are running it in the MainThread or not (https://github.com/mapbox/rasterio/blob/master/rasterio/_env.pyx#L158-L161) and then use different function to set the GDAL config.
The problem here seems that because we use simple def
we run the endpoint function
in a sub Thread (ThreadPoolExecutor-X_X
) and then the env cannot be forwarded to threads within the threads... (yes threads within threads seems bad anyway).
It's kinda hard for me to explain so here is an example
from concurrent import futures
import rasterio
from rasterio._env import get_gdal_config
import threading
from starlette.concurrency import run_in_threadpool
def f(r=None):
return get_gdal_config("GDAL_DISABLE_READDIR_ON_OPEN"), str(threading.current_thread())
print()
print("1 - simple - OK")
with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="FALSE"):
print(f())
print()
print("2 - async simple - OK")
with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="FALSE"):
print(await run_in_threadpool(f))
def g():
print("Where am I: " + str(threading.current_thread())) # print what thread is used when calling rasterio.Env
with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="FALSE"):
with futures.ThreadPoolExecutor() as executor:
return list(executor.map(f, range(1)))[0]
print()
print("3 - simple multi threads - OK")
print(g())
print()
print("4 - async multithread - NOK")
print(await run_in_threadpool(g))
1 - simple - OK
('FALSE', '<_MainThread(MainThread, started 4487112128)>')
2 - async simple - OK
('FALSE', '<Thread(ThreadPoolExecutor-0_13, started daemon 123145455685632)>')
3 - simple multi threads - OK
Where am I: <_MainThread(MainThread, started 4487112128)>
('FALSE', '<Thread(ThreadPoolExecutor-23_0, started daemon 123145492471808)>')
4 - async multithread - NOK
Where am I: <Thread(ThreadPoolExecutor-0_4, started daemon 123145408389120)>
('EMPTY_DIR', '<Thread(ThreadPoolExecutor-24_0, started daemon 123145492471808)>')
☝️ where we use run_in_threadpool
we simulate the actual setting in titiler (def
), fastAPI doesn't use run_in_threadpool
when using async def
.
Fix ?
- use
async def
for functions definition
If you are using a third party library that communicates with something (a database, an API, the file system, etc) and doesn't have support for using await, (this is currently the case for most database libraries), then declare your path operation functions as normally, with just def, like:
FastAPI docs seems to say we shouldn't (and I think @geospatial-jeff told the same in the past)
- see if we can change rasterio (🙅)
Not an expert here, but I guess there is a good reason to use CPLSetConfigOption
only in MainThread (cc @sgillies, sorry for the ping but it's only a FYI)
ref: rasterio/rasterio#1012 & rasterio/rasterio#997
- change rio-tiler reader to forward gdal config 🤷♂️
Seems the safest choice 😭