Broken Postgres clients are released back into the pool, but should be removed #5112
Description
Issue type:
[x] bug report
Database system/driver:
[x] postgres
It may impact other drivers if they have similar semantics/expectations as pg.Pool
.
TypeORM version:
[x] latest
Explanation of the problem
Currently in TypeORM, if a client suffers an unrecoverable error - for example, if the underlying connection goes away, such as during a DB failover - there is no protection in place to stop that client being added back into the pg.Pool
. This broken client will then be handed out in future even though it'll never be able to execute a successful query.
Although pg.Pool
itself does listen for error events on clients within the pool - and actively removes any which do emit errors - it doesn't catch everything. It is considered the responsibility of the user of the pool to release known broken clients by calling client.release(true)
. The truthy argument tells the pool to destroy the connection instead of adding it back to the pool
https://node-postgres.com/api/pool#releaseCallback
The releaseCallback releases an acquired client back to the pool. If you pass a truthy value in the err position to the callback, instead of releasing the client to the pool, the pool will be instructed to disconnect and destroy this client, leaving a space within itself for a new client.
If a client's connection does break, it's very difficult to debug, as a handful of queries will begin failing with an error like Error: Client has encountered a connection error and is not queryable
while others will continue executing fine.
There is further discussion about the impact of this in the node-postgres
repo: brianc/node-postgres#1942
Steps to reproduce or a small repository showing the problem
- Set pool size to 1
- Run a small loop which repeatedly performs a query, e.g.
while (true) {
try {
await getManager().query('select pg_sleep(4)')
console.log('success')
} catch (err) {
console.log('error', err)
}
console.log('done')
await sleep(1000)
}
- Wait for the query to run at least once. This should allocate a connection from the pool. Subsequent invocations of the query should use the same connection.
- Kill the connection while query is running. On a Mac or Linux, this is easily done by finding and killing the process, e.g.
ps aux | grep postgres | grep SELECT
- The loop continue to throw errors, as the broken connection has been returned to the pool and continues to be handed out. This shouldn't happen: we should have told
pg.Pool
the connection is broken so it can be removed from the pool, so the next query should get a new, unbroken, connection
I believe the reason this hasn't been noticed before (at least not that I could see) is because it's really only likely to happen if the actual database connection breaks. The majority of QueryFailedErrors
are caused by dodgy SQL etc, none of which will render a client unusable. And, usually, if your database is killing connections, you've got other problems to think about 😅
We only noticed it because we run PgBouncer in between TypeORM and our Postgres server. When we redeployed PgBouncer, it would kill some of the active client connections in the pool, but because pg.Pool
never found out about it, those connections remained in the pool indefinitely, causing a steady stream of errors even though everything else was fine.
Fix
I have a working fix here: loyaltylion@6bd52e0
If this fix looks suitable, I'd be happy to create a PR to get this merged. It only applies to postgres, but could be extended to other drivers if we think they'd benefit.