Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: try re-execute statement if backend responds with "prepared statement is no longer valid" #451

Merged
merged 3 commits into from
Aug 19, 2016

Conversation

vlsi
Copy link
Member

@vlsi vlsi commented Dec 9, 2015

Bug report: http://stackoverflow.com/questions/34180932/error-cached-plan-must-not-change-result-type-when-mixing-ddl-with-select-via
When using "deallocate all" or "alter table", cached prepared statements might become invalid, thus we need to reexecute query in those cases

It might require some logging, however I'm not sure what is the best way to log that.

Summary (technical details):

  1. By default pgjdbc watches for DEALLOCATE ALL, and DISCARD ALL statements, and invalidates prepared statement cache if those are detected.
    That does not mean "deallocate all" should be used often. It just means deallocate all might be suitable for maintenance-like procedures.
    Note: deallocate all invalidates only a single connection.
  2. If either prepared statement XXX does not exist or cached statement must not change result type observed, pgjdbc would invalidate prepared statement cache (see deallocateEpoch++;). In general those errors should not happen, however if observed it means something had changed behind the scenes.
  3. By default pgjdbc would retry prepared statement XXX does not exist and cached statement must not change result type errors in autocommit=true (non-transactional) mode.
  4. New connection property: autosave
  • conservative -- rollback to savepoint only in case of "prepared statement does not exist" and
    "cached plan must not change result type". Then the driver would re-execute the statement ant it would pass through
  • never (default) -- never set automatic safepoint. Note: in this mode statements might still fail with "cached plan must not change result type"
    in autoCommit=FALSE mode
  • always -- always rollback to "before statement execution" state in case of failure. This mode prevents "current transaction aborted" errors.
    It is similar to psql's ON_ERROR_ROLLBACK.

For autocommit=true the savepoint is not used.
In conservative mode, savepoint is used only for queries that return something (see org.postgresql.core.v3.QueryExecutorImpl#sendAutomaticSavepoint)

@vlsi vlsi force-pushed the reexecute_on_invalid_cached_statement branch from 2e9ce80 to d6e0540 Compare December 9, 2015 20:43
@lukaseder
Copy link

Thanks @vlsi for submitting this PR. I have reported the original issue on Stack Overflow as a means to document the current behaviour. When googling, I've found a few incidents of this type, and none of them was resolved or understood.

This all has to do with the JDBC driver defaulting to a "prepare threshold" of 5, which is why the sixth DDL statement seems to cause cursors to expire.

I believe that this kind of issue shouldn't be dealt with in the client application, specifically because the client closes each prepared statement, it is no longer the responsibility of the client to handle these stale resources.

@alexdupre
Copy link

+1

It's very useful when you have live database migrations and multiple nodes connecting to the same database that cannot be all upgraded/restarted at the same time. To be able to use prepared statements and avoid downtime is an important feature.

I'm not sure that parsing the textual error message is the best way to do it, there aren't specific error codes identifying such errors?

@vladimirfx
Copy link

We are heavily affected by this issue. We try to solve it on connection pool level (Hikari) by closing poisoned connection but fix it on driver level seems to be more proper way.
But proposed fix is invalid for cases with non English locale on PG server, as in our case.

@vlsi
Copy link
Member Author

vlsi commented Aug 8, 2016

@vladimirfx Can you please clarify which error are you going into? "Prepared statement does not exist" or "cached statement cannot change result type"?
The first one has its own error code, but the latter uses just "not implemented" error code.

@vlsi vlsi force-pushed the reexecute_on_invalid_cached_statement branch from 2609822 to 569f8b8 Compare August 8, 2016 19:29
@vlsi vlsi added this to the 9.4.1210 milestone Aug 8, 2016
@vlsi vlsi force-pushed the reexecute_on_invalid_cached_statement branch 2 times, most recently from 0403b5b to b6e901d Compare August 8, 2016 20:59
@vlsi vlsi changed the title fix: try reexecute statement if backend responds with "prepared statement is no longer valid" fix: try re-execute statement if backend responds with "prepared statement is no longer valid" Aug 8, 2016
@codecov-io
Copy link

codecov-io commented Aug 8, 2016

Current coverage is 61.87% (diff: 83.97%)

Merging #451 into master will increase coverage by 0.23%

@@             master       #451   diff @@
==========================================
  Files           148        150     +2   
  Lines         15075      15151    +76   
  Methods           0          0          
  Messages          0          0          
  Branches       3023       3045    +22   
==========================================
+ Hits           9291       9374    +83   
+ Misses         4537       4525    -12   
- Partials       1247       1252     +5   

Powered by Codecov. Last update 0452e79...f77a3a5

@vlsi
Copy link
Member Author

vlsi commented Aug 8, 2016

If no objections, I will merge this in

@vladimirfx
Copy link

We hit by both but most of time "cached statement cannot change result type" . My coment is about first patch where only error text is analyzed. Current impl seems more robust.

@vlsi
Copy link
Member Author

vlsi commented Aug 10, 2016

Can you clarify what kind of setup you are using? I'm inclined to log warnings in case the above conditions are met.
Is the application using "set search_path"? Is it using "deallocate all"?

@vladimirfx
Copy link

Postgresql 9.3.13 , locale ru_RU.UTF-8 , pgjdbc 9.4.1208 , JDK 1.8 .
2 middleware OSGI containers with couple of modules each. When applying rolling upgrade with minor (backward compatible) schema changes modules that not upgraded stop working with errors like "cached plan must not change result type" (in rus: "в кешированном плане не должен изменяться тип результата").
When such situation detected DBA execute DEALLOCATE ALL but without success, because new type of errors arise: "prepared statement ... does not exist" (rus: "подготовленный оператор не существует").
We already doing connection invalidation on pool level but it is not robust solution because of dependency on localized string and not optimal solution because of failing transaction until all connections in pool will be invalidated.
We are waiting for release with this pool request merged. Thanks!

@vlsi
Copy link
Member Author

vlsi commented Aug 12, 2016

Ok, cool.

Some more work is required (~ #477 to support autoCommit(false) mode as well.

@vlsi vlsi force-pushed the reexecute_on_invalid_cached_statement branch 2 times, most recently from c013109 to e259ce2 Compare August 13, 2016 16:06
@vlsi
Copy link
Member Author

vlsi commented Jan 31, 2018

I think conservative will work well for me because I use JOOQ and all metadata stuff is done at compile-time.

That is interesting. Do you alter column types? Do you just add/remove columns?
Is it select * that fails? I just wonder if JOOQ issues select * kind of queries instead of select column1, column2, ...

I wonder if the ResultSetMetaData traces might be why IntelliJ IDEA seems to not update it's view of my schema, even when I disconnect/reconnect using the GUI? I'll try putting "always" in the URL to see if it helps.

Type cache is a side issue. Please file a case for that. I'm not sure if it is pgjdbc issue, however, let's not discuss it here.
You might be facing #551 depending on pgjdbc version

It seems a shame to put all that effort into transactional DDL and then turn around and tell people not to use it.

That's a fair point. It might be something to discuss with backend developers though.

@shorn
Copy link

shorn commented Jan 31, 2018

Do you alter column types? Do you just add/remove columns?

Generally, I would avoid altering a column type - complicated rename / column changes would be done over a series of steps (add / migrate / drop).
But in this specific instance, I was altering the column type as a shortcut to adding a new enum value - I didn't want to do the whole add/migrate/remove dance over multiple schema evolutions just to add an enum value.

You can see (part of) the upgrade script that was changing the app_user table are listed in this Stack Overflow answer: https://stackoverflow.com/a/48498194/924597

The SQL statement JOOQ is issuing that was causing the PG error is:

select 
  "api_svc"."app_user"."email", 
  "api_svc"."app_user"."app_user_account_id", 
  ..., 
  "api_svc"."app_user"."features" 
from "api_svc"."app_user" 
where "api_svc"."app_user"."email" = 'test@example.com'

Some columns elided, but you get the idea - wilcard column specifier definitely not involved.

I've hit this error message a few times before, but couldn't reliably reproduce until stumbling across this issue and starting to get a clue what was going on.
I had to do things like reduce the pool size to 1 and set prepareThreshold to 1. Plus I got lucky in that the app_user table gets hit straight away on login.


Type cache is a side issue

Yeah, for sure. Not that important anyway, just a minor devopment-time annoyance.


might be something to discuss with backend developers though.

Where does that kind of issue get raised?
There's probably a million reasons why it wouldn't work or would be a bad idea - but I was originally thinking I needed to find some kind of "deallocate all" statement (that worked globally, instead of per-connection). The upgrade script would just issue it as the last thing it did. Of course, while writing that down I thought of at least three problems it would need to solved, but still.

@davecramer
Copy link
Member

davecramer commented Jan 31, 2018 via email

@mj1618
Copy link

mj1618 commented Mar 3, 2018

Performance appears to be unacceptable for us.

We are also trying to make schema changes to a live, medium throughout system and running into issues every solution we look.

We ran basic load tests with autosave=conservative. The average query time remains pretty much the same. However the worst queries get approx 15x worse. We are running at about 100-130requests per second through our application.

It sounds like another solution is to Discard All after our schema change. I worry that we will be dropping users in the time between making the schema change and running discard...

@shorn
Copy link

shorn commented Mar 4, 2018

The average query time remains pretty much the same. However the worst queries get approx 15x worse.

What are the aprroximate numbers here? Are you talking about an increase of 2ms to 30ms? Or an increase of 2sec to 30sec?

@mj1618
Copy link

mj1618 commented Mar 4, 2018

Yea so here are our numbers, I realise how arbitrary they are without the code we used to generate them.

Without autosave conservative:
Avg: 600ms
Min: 50ms
Max: 1800ms
Rps: 130

With autosave conservative:
Avg: 650ms
Min: 35ms
Max: 17100ms
Rps: 120

In this one the worst case was 9.5x worse.

The graphs of these over time looked similar, but the autosave one was bumpy. Ie at some points perf was impacted a lot, but then otherwise it was quite similar.

These weren’t out production soak/stress tests, nor were they run for an extended period, but they were enough to turn us off this solution to the cached query error.

@shorn
Copy link

shorn commented Mar 4, 2018

Holy cow ... 15 seconds added to worst case... o_O
If you're testing is sound, that is.... not great.
Were you actually making changes to the schema during this test? This is really just a basic load test with "autosave=conservative"?

@vlsi
Copy link
Member Author

vlsi commented Mar 4, 2018

I realise how arbitrary they are without the code we used to generate them

The more details you provide, the more odds are we could help.

Do you actually make schema changes during the test? If you do, then regular mode should result in an error.

Do you know which SQL is impacted? Is it something related with a single SQL having multiple execution plans?

@mj1618
Copy link

mj1618 commented Mar 4, 2018

We aren’t running schema changes during the test.

I’ll try to come up with some code to reproduce the issue that we can share.

@orionrenegade
Copy link

orionrenegade commented Mar 5, 2018

In this one the worst case was 9.5x worse.

When I first asked on this thread (as 'laglace') for troubleshooting related to the autosave feature, I did also notice a significant bump in performance.

The application in which I was working involved a lot of legacy code and several anti-patterns: one of those involved having huge transactions (we're talking about something ranging from twenty minutes to 1 hour, in the best-case scenario), with several thousands of different queries done throughout the process.

PostgreSQL managed to handle the first 5 minutes of the transaction without an issue, but performance would degrade rapidly after that mark, to the point where simple SELECT queries would start to take 30 seconds. As the transaction continued, these values would increase linearly and after 10 hours, the each single query to the database would be taking about 15 minutes.

With autosave disabled, performance was MUCH smoother, but would involve refactoring large chunks of code, so I never did get my hands around that, as the management was not too pleased with the idea of delays. As I don't work for that company any longer, I can't provide any queries nor EXPLAIN's, but I think this post pretty much covers my findings related to this issue.

@davecramer
Copy link
Member

well there's no question that there will be an performance impact for this feature. The whole reason this feature exists is to try to mimic the behaviour of other databases. PostgreSQL does not normally allow partial commits to persist, others do. It does this by inserting a savepoint in between statements. Clearly this will have performance ramifications. I don't see any way to fix this.

@mj1618
Copy link

mj1618 commented Mar 6, 2018

Yea I understand and agree.

We will just have to find a different way to overcome this error. Maybe we’re missing something, just seems this should be an easy problem to solve to allow non disruptive schema changes to live tables.

@davecramer
Copy link
Member

davecramer commented Mar 6, 2018 via email

@mj1618
Copy link

mj1618 commented Mar 6, 2018

Thanks for the tip!

@vlsi
Copy link
Member Author

vlsi commented Mar 6, 2018

@orionrenegade , do you know which sql fails?
Is it select * or select _explicit_column_list_ ?

@mj1618
Copy link

mj1618 commented Mar 6, 2018

I tested “select explicit_column_list” which does fail even when only adding a column to the table.

@davecramer
Copy link
Member

davecramer commented Mar 6, 2018 via email

@vlsi
Copy link
Member Author

vlsi commented Mar 6, 2018

I tested “select explicit_column_list” which does fail even when only adding a column to the table.

I wonder if that can be treated at backend side. I just thought explicit_list should be OK.

@davecramer
Copy link
Member

davecramer commented Mar 6, 2018 via email

@vlsi
Copy link
Member Author

vlsi commented Mar 6, 2018

"cached plan must not change result type" does compare tuple descriptor, so I wonder what is the error message for the case when a column is added.

https://github.com/postgres/postgres/blob/9d4649ca49416111aee2c84b7e4441a0b7aa2fac/src/backend/utils/cache/plancache.c#L715-L722

@davecramer
Copy link
Member

@mj1618 can you get the error message from the backend ?

@mj1618
Copy link

mj1618 commented Mar 6, 2018

Yes this is the exact error we get.

@orionrenegade
Copy link

@orionrenegade , do you know which sql fails?
Is it select * or select explicit_column_list ?

Like I said, I don't have access to the queries any more, but both cases were problematic. (In fact, any query would be problematic under those stressing circumstances, even to tables with fewer than 100 entries.)

@vlsi
Copy link
Member Author

vlsi commented Mar 6, 2018

@mj1618 , I've just did

    for (int i = 0; i < 10; i++) {
      PreparedStatement ps = con.prepareStatement("select i, t from test_statement");
      ResultSet rs = ps.executeQuery();
      while (rs.next()) {
        System.out.println("rs.getString(1) = " + rs.getString(1));
      }
      rs.close();
      ps.close();
      con.createStatement().execute("alter table test_statement add column c" + i + " int4");
    }

and it works with no problems with PostgreSQL 9.6

That is there should be something else besides "a mere column addition" to trigger the issue.

@mj1618
Copy link

mj1618 commented Mar 6, 2018

Ok I see your point. We were using Flyway to do the schema change, I wonder if it is executing something to cause the issue.

@mj1618
Copy link

mj1618 commented Mar 6, 2018

After starting the ALTER TABLE in flyway we run a SELECT from another process, this is what caused the issue.

I will try to write a small PoC to demonstrate.

@mj1618
Copy link

mj1618 commented Mar 6, 2018

No you are right, my apologies, only select * is affected.

Here is what I used:

        String url = "jdbc:postgresql://localhost:5432/...?user=...&password=...&prepareThreshold=1";
        Connection con = DriverManager.getConnection(url);
        con.setAutoCommit(false);
        con.setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED);
        while (true) {
            PreparedStatement ps = con.prepareStatement("select id, user_id from \"user\"");
            ResultSet rs = ps.executeQuery();
            while (rs.next()) {
                System.out.println("rs.getString(1) = " + rs.getString(1));
            }
            rs.close();
            ps.close();
            con.commit();
        }

With ALTER TABLE "user" ADD COLUMN mycol TEXT null; SELECT pg_sleep(5); running with flyway.

And worked fine. Changing to * did produce the error.

I must have got it wrong - our primary sql obfuscation tool Hibernate must be doing something different than we assumed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.