fix: try re-execute statement if backend responds with "prepared statement is no longer valid" #451

vlsi · 2015-12-09T20:41:09Z

Bug report: http://stackoverflow.com/questions/34180932/error-cached-plan-must-not-change-result-type-when-mixing-ddl-with-select-via
When using "deallocate all" or "alter table", cached prepared statements might become invalid, thus we need to reexecute query in those cases

It might require some logging, however I'm not sure what is the best way to log that.

Summary (technical details):

By default pgjdbc watches for DEALLOCATE ALL, and DISCARD ALL statements, and invalidates prepared statement cache if those are detected.
That does not mean "deallocate all" should be used often. It just means deallocate all might be suitable for maintenance-like procedures.
Note: deallocate all invalidates only a single connection.
If either prepared statement XXX does not exist or cached statement must not change result type observed, pgjdbc would invalidate prepared statement cache (see deallocateEpoch++;). In general those errors should not happen, however if observed it means something had changed behind the scenes.
By default pgjdbc would retry prepared statement XXX does not exist and cached statement must not change result type errors in autocommit=true (non-transactional) mode.
New connection property: autosave

conservative -- rollback to savepoint only in case of "prepared statement does not exist" and
"cached plan must not change result type". Then the driver would re-execute the statement ant it would pass through
never (default) -- never set automatic safepoint. Note: in this mode statements might still fail with "cached plan must not change result type"
in autoCommit=FALSE mode
always -- always rollback to "before statement execution" state in case of failure. This mode prevents "current transaction aborted" errors.
It is similar to psql's ON_ERROR_ROLLBACK.

For autocommit=true the savepoint is not used.
In conservative mode, savepoint is used only for queries that return something (see org.postgresql.core.v3.QueryExecutorImpl#sendAutomaticSavepoint)

lukaseder · 2015-12-10T08:05:51Z

Thanks @vlsi for submitting this PR. I have reported the original issue on Stack Overflow as a means to document the current behaviour. When googling, I've found a few incidents of this type, and none of them was resolved or understood.

This all has to do with the JDBC driver defaulting to a "prepare threshold" of 5, which is why the sixth DDL statement seems to cause cursors to expire.

I believe that this kind of issue shouldn't be dealt with in the client application, specifically because the client closes each prepared statement, it is no longer the responsibility of the client to handle these stale resources.

alexdupre · 2016-03-31T09:29:28Z

+1

It's very useful when you have live database migrations and multiple nodes connecting to the same database that cannot be all upgraded/restarted at the same time. To be able to use prepared statements and avoid downtime is an important feature.

I'm not sure that parsing the textual error message is the best way to do it, there aren't specific error codes identifying such errors?

vladimirfx · 2016-08-08T12:43:30Z

We are heavily affected by this issue. We try to solve it on connection pool level (Hikari) by closing poisoned connection but fix it on driver level seems to be more proper way.
But proposed fix is invalid for cases with non English locale on PG server, as in our case.

vlsi · 2016-08-08T14:36:52Z

@vladimirfx Can you please clarify which error are you going into? "Prepared statement does not exist" or "cached statement cannot change result type"?
The first one has its own error code, but the latter uses just "not implemented" error code.

codecov-io · 2016-08-08T21:17:24Z

Current coverage is 61.87% (diff: 83.97%)

Merging #451 into master will increase coverage by 0.23%

@@             master       #451   diff @@
==========================================
  Files           148        150     +2   
  Lines         15075      15151    +76   
  Methods           0          0          
  Messages          0          0          
  Branches       3023       3045    +22   
==========================================
+ Hits           9291       9374    +83   
+ Misses         4537       4525    -12   
- Partials       1247       1252     +5

Powered by Codecov. Last update 0452e79...f77a3a5

vlsi · 2016-08-08T21:37:23Z

If no objections, I will merge this in

vladimirfx · 2016-08-10T10:15:45Z

We hit by both but most of time "cached statement cannot change result type" . My coment is about first patch where only error text is analyzed. Current impl seems more robust.

vlsi · 2016-08-10T10:43:54Z

Can you clarify what kind of setup you are using? I'm inclined to log warnings in case the above conditions are met.
Is the application using "set search_path"? Is it using "deallocate all"?

vladimirfx · 2016-08-11T07:20:54Z

Postgresql 9.3.13 , locale ru_RU.UTF-8 , pgjdbc 9.4.1208 , JDK 1.8 .
2 middleware OSGI containers with couple of modules each. When applying rolling upgrade with minor (backward compatible) schema changes modules that not upgraded stop working with errors like "cached plan must not change result type" (in rus: "в кешированном плане не должен изменяться тип результата").
When such situation detected DBA execute DEALLOCATE ALL but without success, because new type of errors arise: "prepared statement ... does not exist" (rus: "подготовленный оператор не существует").
We already doing connection invalidation on pool level but it is not robust solution because of dependency on localized string and not optimal solution because of failing transaction until all connections in pool will be invalidated.
We are waiting for release with this pool request merged. Thanks!

vlsi · 2016-08-12T14:46:54Z

Ok, cool.

Some more work is required (~ #477 to support autoCommit(false) mode as well.

vlsi · 2018-01-31T10:51:50Z

I think conservative will work well for me because I use JOOQ and all metadata stuff is done at compile-time.

That is interesting. Do you alter column types? Do you just add/remove columns?
Is it select * that fails? I just wonder if JOOQ issues select * kind of queries instead of select column1, column2, ...

I wonder if the ResultSetMetaData traces might be why IntelliJ IDEA seems to not update it's view of my schema, even when I disconnect/reconnect using the GUI? I'll try putting "always" in the URL to see if it helps.

Type cache is a side issue. Please file a case for that. I'm not sure if it is pgjdbc issue, however, let's not discuss it here.
You might be facing #551 depending on pgjdbc version

It seems a shame to put all that effort into transactional DDL and then turn around and tell people not to use it.

That's a fair point. It might be something to discuss with backend developers though.

shorn · 2018-01-31T12:01:17Z

Do you alter column types? Do you just add/remove columns?

Generally, I would avoid altering a column type - complicated rename / column changes would be done over a series of steps (add / migrate / drop).
But in this specific instance, I was altering the column type as a shortcut to adding a new enum value - I didn't want to do the whole add/migrate/remove dance over multiple schema evolutions just to add an enum value.

You can see (part of) the upgrade script that was changing the app_user table are listed in this Stack Overflow answer: https://stackoverflow.com/a/48498194/924597

The SQL statement JOOQ is issuing that was causing the PG error is:

select 
  "api_svc"."app_user"."email", 
  "api_svc"."app_user"."app_user_account_id", 
  ..., 
  "api_svc"."app_user"."features" 
from "api_svc"."app_user" 
where "api_svc"."app_user"."email" = 'test@example.com'

Some columns elided, but you get the idea - wilcard column specifier definitely not involved.

I've hit this error message a few times before, but couldn't reliably reproduce until stumbling across this issue and starting to get a clue what was going on.
I had to do things like reduce the pool size to 1 and set prepareThreshold to 1. Plus I got lucky in that the app_user table gets hit straight away on login.

Type cache is a side issue

Yeah, for sure. Not that important anyway, just a minor devopment-time annoyance.

might be something to discuss with backend developers though.

Where does that kind of issue get raised?
There's probably a million reasons why it wouldn't work or would be a bad idea - but I was originally thinking I needed to find some kind of "deallocate all" statement (that worked globally, instead of per-connection). The upgrade script would just issue it as the last thing it did. Of course, while writing that down I thought of at least three problems it would need to solved, but still.

davecramer · 2018-01-31T13:14:26Z

Others have solved the zero downtime deployments but it isn't easy. Mostly by using schema's and search paths. Dave Cramer

…

On 31 January 2018 at 04:56, Shorn ***@***.***> wrote: @vlsi <https://github.com/vlsi> Regarding PR: Sure I'll give it a shot - but I've not contributed to pgjdbc before so it'll take me a while to learn my way around. Regarding pgjdbc "traces": That's good to know. My current main usecase is an AWS RDS postgres instance whose schema gets updated by a Flyway process while the app-server(s) is running (aiming for zero/minimal downtime with rapid schema evolution). I think conservative will work well for me because I use JOOQ and all metadata stuff is done at compile-time. I wonder if the ResultSetMetaData traces might be why IntelliJ IDEA seems to not update it's view of my schema, even when I disconnect/reconnect using the GUI? I'll try putting "always" in the URL to see if it helps. ------------------------------ so you'd better avoid mixing alters and selects. I am disappointed to hear that. How do we achieve that other than by having downtime? If this is true - isn't it really saying that Postgres shouldn't be used when aiming for "zero downtime" deployments? It seems a shame to put all that effort into transactional DDL and then turn around and tell people not to use it. I wonder if this is something Amazon have already tackled with Aurora Postgres. Or even their unreleased "serverless" Aurora thing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#451 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAYz9gOj3cGCdpp4qHarPcekeDsV4G4Xks5tQDjUgaJpZM4GyMqd> .

mj1618 · 2018-03-03T04:23:20Z

Performance appears to be unacceptable for us.

We are also trying to make schema changes to a live, medium throughout system and running into issues every solution we look.

We ran basic load tests with autosave=conservative. The average query time remains pretty much the same. However the worst queries get approx 15x worse. We are running at about 100-130requests per second through our application.

It sounds like another solution is to Discard All after our schema change. I worry that we will be dropping users in the time between making the schema change and running discard...

shorn · 2018-03-04T02:54:25Z

The average query time remains pretty much the same. However the worst queries get approx 15x worse.

What are the aprroximate numbers here? Are you talking about an increase of 2ms to 30ms? Or an increase of 2sec to 30sec?

mj1618 · 2018-03-04T04:03:28Z

Yea so here are our numbers, I realise how arbitrary they are without the code we used to generate them.

Without autosave conservative:
Avg: 600ms
Min: 50ms
Max: 1800ms
Rps: 130

With autosave conservative:
Avg: 650ms
Min: 35ms
Max: 17100ms
Rps: 120

In this one the worst case was 9.5x worse.

The graphs of these over time looked similar, but the autosave one was bumpy. Ie at some points perf was impacted a lot, but then otherwise it was quite similar.

These weren’t out production soak/stress tests, nor were they run for an extended period, but they were enough to turn us off this solution to the cached query error.

shorn · 2018-03-04T04:08:17Z

Holy cow ... 15 seconds added to worst case... o_O
If you're testing is sound, that is.... not great.
Were you actually making changes to the schema during this test? This is really just a basic load test with "autosave=conservative"?

vlsi · 2018-03-04T04:22:25Z

I realise how arbitrary they are without the code we used to generate them

The more details you provide, the more odds are we could help.

Do you actually make schema changes during the test? If you do, then regular mode should result in an error.

Do you know which SQL is impacted? Is it something related with a single SQL having multiple execution plans?

mj1618 · 2018-03-04T04:38:00Z

We aren’t running schema changes during the test.

I’ll try to come up with some code to reproduce the issue that we can share.

orionrenegade · 2018-03-05T09:38:01Z

In this one the worst case was 9.5x worse.

When I first asked on this thread (as 'laglace') for troubleshooting related to the autosave feature, I did also notice a significant bump in performance.

The application in which I was working involved a lot of legacy code and several anti-patterns: one of those involved having huge transactions (we're talking about something ranging from twenty minutes to 1 hour, in the best-case scenario), with several thousands of different queries done throughout the process.

PostgreSQL managed to handle the first 5 minutes of the transaction without an issue, but performance would degrade rapidly after that mark, to the point where simple SELECT queries would start to take 30 seconds. As the transaction continued, these values would increase linearly and after 10 hours, the each single query to the database would be taking about 15 minutes.

With autosave disabled, performance was MUCH smoother, but would involve refactoring large chunks of code, so I never did get my hands around that, as the management was not too pleased with the idea of delays. As I don't work for that company any longer, I can't provide any queries nor EXPLAIN's, but I think this post pretty much covers my findings related to this issue.

davecramer · 2018-03-06T11:06:24Z

well there's no question that there will be an performance impact for this feature. The whole reason this feature exists is to try to mimic the behaviour of other databases. PostgreSQL does not normally allow partial commits to persist, others do. It does this by inserting a savepoint in between statements. Clearly this will have performance ramifications. I don't see any way to fix this.

mj1618 · 2018-03-06T11:10:23Z

Yea I understand and agree.

We will just have to find a different way to overcome this error. Maybe we’re missing something, just seems this should be an easy problem to solve to allow non disruptive schema changes to live tables.

davecramer · 2018-03-06T11:16:56Z

this is how Zalando did it https://github.com/zalando-stups/java-sproc-wrapper Dave Cramer

…

On 6 March 2018 at 06:10, Matthew James ***@***.***> wrote: Yea I understand and agree. We will just have to find a different way to overcome this error. Maybe we’re missing something, just seems this should be an easy problem to solve to allow non disruptive schema changes to live tables. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#451 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAYz9qFqNZxvmgEa0Zm2gJCAnfx11Zkqks5tbm6fgaJpZM4GyMqd> .

mj1618 · 2018-03-06T11:41:04Z

Thanks for the tip!

vlsi · 2018-03-06T12:52:46Z

@orionrenegade , do you know which sql fails?
Is it select * or select _explicit_column_list_ ?

mj1618 · 2018-03-06T12:54:50Z

I tested “select explicit_column_list” which does fail even when only adding a column to the table.

davecramer · 2018-03-06T12:56:33Z

On 6 March 2018 at 07:54, Matthew James ***@***.***> wrote: I tested “select explicit_column_list” which does fail even when only adding a column to the table.

Hmm that's cool, it should survive the reparsing? Dave Cramer

vlsi · 2018-03-06T12:59:20Z

I tested “select explicit_column_list” which does fail even when only adding a column to the table.

I wonder if that can be treated at backend side. I just thought explicit_list should be OK.

davecramer · 2018-03-06T13:04:13Z

On 6 March 2018 at 07:59, Vladimir Sitnikov ***@***.***> wrote: I tested “select explicit_column_list” which does fail even when only adding a column to the table. I wonder if that can be treated at backend side. I just thought explicit_list should be OK.

You would think, but recall that a schema change does change pg_attribute so depends on what they cache. Dave Cramer —

…

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#451 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAYz9gxjOvJSuyXaDuqWthSbrRGsR6SLks5tbogtgaJpZM4GyMqd> .

vlsi · 2018-03-06T13:05:44Z

"cached plan must not change result type" does compare tuple descriptor, so I wonder what is the error message for the case when a column is added.

https://github.com/postgres/postgres/blob/9d4649ca49416111aee2c84b7e4441a0b7aa2fac/src/backend/utils/cache/plancache.c#L715-L722

davecramer · 2018-03-06T13:08:54Z

@mj1618 can you get the error message from the backend ?

mj1618 · 2018-03-06T13:09:56Z

Yes this is the exact error we get.

orionrenegade · 2018-03-06T13:15:52Z

@orionrenegade , do you know which sql fails?
Is it select * or select explicit_column_list ?

Like I said, I don't have access to the queries any more, but both cases were problematic. (In fact, any query would be problematic under those stressing circumstances, even to tables with fewer than 100 entries.)

vlsi · 2018-03-06T13:16:40Z

@mj1618 , I've just did

    for (int i = 0; i < 10; i++) {
      PreparedStatement ps = con.prepareStatement("select i, t from test_statement");
      ResultSet rs = ps.executeQuery();
      while (rs.next()) {
        System.out.println("rs.getString(1) = " + rs.getString(1));
      }
      rs.close();
      ps.close();
      con.createStatement().execute("alter table test_statement add column c" + i + " int4");
    }

and it works with no problems with PostgreSQL 9.6

That is there should be something else besides "a mere column addition" to trigger the issue.

mj1618 · 2018-03-06T13:21:36Z

Ok I see your point. We were using Flyway to do the schema change, I wonder if it is executing something to cause the issue.

mj1618 · 2018-03-06T13:24:22Z

After starting the ALTER TABLE in flyway we run a SELECT from another process, this is what caused the issue.

I will try to write a small PoC to demonstrate.

mj1618 · 2018-03-06T14:25:18Z

No you are right, my apologies, only select * is affected.

Here is what I used:

        String url = "jdbc:postgresql://localhost:5432/...?user=...&password=...&prepareThreshold=1";
        Connection con = DriverManager.getConnection(url);
        con.setAutoCommit(false);
        con.setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED);
        while (true) {
            PreparedStatement ps = con.prepareStatement("select id, user_id from \"user\"");
            ResultSet rs = ps.executeQuery();
            while (rs.next()) {
                System.out.println("rs.getString(1) = " + rs.getString(1));
            }
            rs.close();
            ps.close();
            con.commit();
        }

With ALTER TABLE "user" ADD COLUMN mycol TEXT null; SELECT pg_sleep(5); running with flyway.

And worked fine. Changing to * did produce the error.

I must have got it wrong - our primary sql obfuscation tool Hibernate must be doing something different than we assumed.

vlsi force-pushed the reexecute_on_invalid_cached_statement branch from 2e9ce80 to d6e0540 Compare December 9, 2015 20:43

vlsi force-pushed the reexecute_on_invalid_cached_statement branch from d6e0540 to dc845a7 Compare December 10, 2015 08:13

vlsi force-pushed the reexecute_on_invalid_cached_statement branch from dc845a7 to 2609822 Compare December 21, 2015 07:22

vlsi force-pushed the master branch 6 times, most recently from fbedb9f to 5d43712 Compare December 29, 2015 15:15

vlsi mentioned this pull request Dec 29, 2015

Add "alter table while holding a server-prepared statement" test impossibl/pgjdbc-ng#221

Closed

vlsi force-pushed the master branch from 84ebe9a to e52f7e3 Compare January 7, 2016 08:05

vlsi mentioned this pull request Jan 19, 2016

"Cached plan must not change result type" when using PreparedStatement cache across schemas #496

Closed

vlsi mentioned this pull request Jun 30, 2016

Receiving "ERROR: prepared statement \"S_2\" does not exist" #596

Closed

vlsi force-pushed the reexecute_on_invalid_cached_statement branch from 2609822 to 569f8b8 Compare August 8, 2016 19:29

vlsi added this to the 9.4.1210 milestone Aug 8, 2016

vlsi force-pushed the reexecute_on_invalid_cached_statement branch 2 times, most recently from 0403b5b to b6e901d Compare August 8, 2016 20:59

vlsi changed the title ~~fix: try reexecute statement if backend responds with "prepared statement is no longer valid"~~ fix: try re-execute statement if backend responds with "prepared statement is no longer valid" Aug 8, 2016

vlsi force-pushed the reexecute_on_invalid_cached_statement branch 2 times, most recently from c013109 to e259ce2 Compare August 13, 2016 16:06

seddy mentioned this pull request Jan 9, 2019

Feature request - rollback to save point and retry queries on known safe failures elixir-ecto/postgrex#434

Closed

mfn mentioned this pull request Feb 14, 2019

[Proposal] Improve Postgres driver resilience against "cached plan must not change result type" laravel/ideas#1519

Open

dcitron mentioned this pull request Aug 22, 2023

PgPreparedStatement.getMetaData() does not retry if "prepared statement is no longer valid" #2944

Open

fix: try re-execute statement if backend responds with "prepared statement is no longer valid" #451

fix: try re-execute statement if backend responds with "prepared statement is no longer valid" #451

Conversation

vlsi commented Dec 9, 2015 • edited Loading

lukaseder commented Dec 10, 2015

alexdupre commented Mar 31, 2016

vladimirfx commented Aug 8, 2016

vlsi commented Aug 8, 2016

codecov-io commented Aug 8, 2016 • edited Loading

Current coverage is 61.87% (diff: 83.97%)

vlsi commented Aug 8, 2016

vladimirfx commented Aug 10, 2016

vlsi commented Aug 10, 2016

vladimirfx commented Aug 11, 2016

vlsi commented Aug 12, 2016

vlsi commented Jan 31, 2018

shorn commented Jan 31, 2018

davecramer commented Jan 31, 2018 via email

mj1618 commented Mar 3, 2018

shorn commented Mar 4, 2018

mj1618 commented Mar 4, 2018

shorn commented Mar 4, 2018 • edited Loading

vlsi commented Mar 4, 2018

mj1618 commented Mar 4, 2018

orionrenegade commented Mar 5, 2018 • edited Loading

davecramer commented Mar 6, 2018

mj1618 commented Mar 6, 2018

davecramer commented Mar 6, 2018 via email

mj1618 commented Mar 6, 2018

vlsi commented Mar 6, 2018

mj1618 commented Mar 6, 2018

davecramer commented Mar 6, 2018 via email

vlsi commented Mar 6, 2018

davecramer commented Mar 6, 2018 via email

vlsi commented Mar 6, 2018

davecramer commented Mar 6, 2018

mj1618 commented Mar 6, 2018

orionrenegade commented Mar 6, 2018

vlsi commented Mar 6, 2018

mj1618 commented Mar 6, 2018

mj1618 commented Mar 6, 2018

mj1618 commented Mar 6, 2018 • edited Loading

vlsi commented Dec 9, 2015 •

edited

Loading

codecov-io commented Aug 8, 2016 •

edited

Loading

shorn commented Mar 4, 2018 •

edited

Loading

orionrenegade commented Mar 5, 2018 •

edited

Loading

mj1618 commented Mar 6, 2018 •

edited

Loading