Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support aborted transactions internal retry #544

Closed

Conversation

IlyaFaer
Copy link
Contributor

@IlyaFaer IlyaFaer commented Oct 23, 2020

Implement aborted transactions retry mechanism.

While executing SQL statements in !autocommit mode, connection must remember every executed statement. In case the transaction aborted, all of these statements should be re-executed. Doing this, connection also must calculate checksum of every statement results, so that we could check if the retried transaction got the same results that the original one got. In case the checksums are not equal there is no way to continue transaction due to underlying data being changed during retry.

Closes #539

@IlyaFaer IlyaFaer added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: spanner Issues related to the googleapis/python-spanner-django API. labels Oct 23, 2020
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Oct 23, 2020
google/cloud/spanner_dbapi/checksum.py Outdated Show resolved Hide resolved
google/cloud/spanner_dbapi/connection.py Show resolved Hide resolved
google/cloud/spanner_dbapi/connection.py Outdated Show resolved Hide resolved
google/cloud/spanner_dbapi/cursor.py Show resolved Hide resolved
google/cloud/spanner_dbapi/cursor.py Show resolved Hide resolved
@IlyaFaer
Copy link
Contributor Author

IlyaFaer commented Oct 23, 2020

@c24t, @olavloite, I've pushed a part of transaction retry mechanism implementation, PTAL.

google/cloud/spanner_dbapi/cursor.py Show resolved Hide resolved
google/cloud/spanner_dbapi/checksum.py Outdated Show resolved Hide resolved
google/cloud/spanner_dbapi/checksum.py Outdated Show resolved Hide resolved
google/cloud/spanner_dbapi/cursor.py Show resolved Hide resolved
google/cloud/spanner_dbapi/cursor.py Outdated Show resolved Hide resolved
google/cloud/spanner_dbapi/cursor.py Show resolved Hide resolved
tests/spanner_dbapi/test_checksum.py Outdated Show resolved Hide resolved
google/cloud/spanner_dbapi/checksum.py Outdated Show resolved Hide resolved
tests/spanner_dbapi/test_checksum.py Outdated Show resolved Hide resolved
@IlyaFaer
Copy link
Contributor Author

IlyaFaer commented Oct 26, 2020

Pushed the second step of the implementation. Added actual retrying of aborted transaction. Requires to add more unit tests yet, for Cursor - to check if fetch* methods are retrying transactions correctly.


client_mock.assert_called_once_with(
project=PROJECT,
credentials=CREDENTIALS,
client_info=CLIENT_INFO,
Copy link
Contributor Author

@IlyaFaer IlyaFaer Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is actually broken by another PR, but nox doesn't see it and doesn't run it. These tests were not moved to unit directory.

from google.cloud.spanner_dbapi import Connection, InterfaceError
from google.cloud.spanner_dbapi.checksum import ResultsChecksum
from google.cloud.spanner_dbapi.connection import AUTOCOMMIT_MODE_WARNING
from google.cloud.spanner_v1.database import Database
from google.cloud.spanner_v1.instance import Instance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these tests are still here? They were copied into unit directory, so I suppose they should be erased from this directory?!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Looks like this (and test_connect) weren't moved in #532?

https://github.com/q-logic/python-spanner-django/blob/41abaebb6f2e0b1cf16704aa1e394acc5a47e68b/tests/spanner_dbapi/test_connection.py

The test files weren't exactly copied, #532 changed them and added some new tests. E.g. the version on master now doesn't include test_transaction_autocommit_warnings.

@mf2199 can you confirm that you meant to change/remove these tests before removing tests/spanner_dbapi in this PR?

@IlyaFaer
Copy link
Contributor Author

@olavloite, am I understood correctly, that commit() can also be aborted? And in this case we should retry the whole transaction?

@olavloite
Copy link
Contributor

@olavloite, am I understood correctly, that commit() can also be aborted? And in this case we should retry the whole transaction?

@IlyaFaer Yes, that is correct.

Copy link
Contributor

@c24t c24t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides cleaning up the tests, I wonder if this PR is catching Aborted errors in the right place. Since we're only calling execute_sql and not the streaming variant, I'd expect to see errors immediately instead of when we iterate over the results in the cursor. Am I missing something here?

tests/unit/spanner_dbapi/test_cursor.py Show resolved Hide resolved
google/cloud/spanner_dbapi/cursor.py Outdated Show resolved Hide resolved
google/cloud/spanner_dbapi/connection.py Outdated Show resolved Hide resolved
google/cloud/spanner_dbapi/connection.py Show resolved Hide resolved
return
except Aborted:
self.connection.retry_transaction()
return self.fetchone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be a problem. Assuming that this is using the ExecuteStreamingSql RPC, then each next() call could potentially mean that a new RPC is executed. So for the sake of simplicity, assume in the example below that each call to next() executes the ExecuteStreamingSql RPC.

Assume the following situation:

  1. The table Singers contains the following singers (last names): Allison, Morrison, Pieterson
  2. The application executes the query SELECT LastName FROM Singers ORDER BY LastName in transaction 1.
  3. The client application calls fetchone() which returns 'Allison'.
  4. Some other transaction executes `DELETE FROM Singers WHERE LastName='Pieterson'.
  5. The first transaction is aborted by the backend. A retry is executed and the retry logic checks that the checksum of the retried result set is equal to the original attempt, which it is as the first record is still 'Allison'.
  6. The client application calls fetchone() again. This should return 'Morrison', but as it needs to call ExecuteStreamingSql it will (probably) use the transaction id of the original transaction (unless that transaction id has somehow been replaced in the underlying iterator). If it does use the old transaction id, the RPC will fail with yet another Aborted error, and that will repeat itself until the transaction retry limit has been reached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me we can just drop the _transaction property, so that Connection will initiate a new one on the next execute() call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IlyaFaer @c24t

Sorry for reopening this, and this comment should not be considered blocking for merging this PR, but I think we need to look into this once more. Only dropping the _transaction property will in this case not be enough for the following reason:

  1. When executeSql is called, a streaming iterator is returned to the application.
  2. That streaming iterator is linked with the transaction that was active at that moment, and a reference to that transaction is also held in the iterator.
  3. If a transaction is aborted and the client application has consumed only parts of a streaming iterator, that iterator is no longer valid (at least: it will also throw an exception if it needs to receive more data from the server).

The JDBC driver client solves the above problem by wrapping all streaming iterators before returning these to the client application. That makes it possible for the JDBC driver to replace the underlying streaming iterator with a new one when a transaction has been aborted and successfully retried.

We should add that to the Python DBApi as well, but we could do that in a separate PR to prevent this PR from becoming even bigger than it already is.

Copy link
Contributor Author

@IlyaFaer IlyaFaer Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c24t, @olavloite, hm-m. I think we're protected from errors here, because our connection API doesn't actually give streaming result objects to a user.

Here is where we're getting a streaming iterator:

self._result_set = transaction.execute_sql(
sql, params, param_types=get_param_types(params)
)
self._itr = PeekIterator(self._result_set)

So, iterator is held in the protected property _itr, and users will be streaming it with Cursor.fetch*() methods, without actual access to the iterator itself:

def fetchone(self):
"""Fetch the next row of a query result set, returning a single
sequence, or None when no more data is available."""
self._raise_if_closed()
try:
return next(self)
except StopIteration:
return None

Where next(self) is calling next(self._itr) here:

def __next__(self):
if self._itr is None:
raise ProgrammingError("no results to return")
return next(self._itr)

Thus, if a transaction failed, the connection will drop the transaction, checkout a new one, re-run all the statements, each of which will replace _itr with a new streamed iterator. So, all the iterators are processed internally, and will be replaced on a retry, as I see.

@IlyaFaer IlyaFaer marked this pull request as ready for review November 4, 2020 07:25
@IlyaFaer IlyaFaer requested a review from a team as a code owner November 4, 2020 07:25
Copy link
Contributor

@olavloite olavloite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I think this is coming very to close to what we need, but I have a couple of small questions.

google/cloud/spanner_dbapi/exceptions.py Outdated Show resolved Hide resolved
google/cloud/spanner_dbapi/connection.py Outdated Show resolved Hide resolved
google/cloud/spanner_dbapi/connection.py Show resolved Hide resolved
Copy link
Contributor

@c24t c24t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more substantial comments from me, but note that this whole PR will have to move to python-spanner now that googleapis/python-spanner#160 is in.

@olavloite, does this address all your comments?

@olavloite
Copy link
Contributor

olavloite commented Nov 12, 2020

No more substantial comments from me, but note that this whole PR will have to move to python-spanner now that googleapis/python-spanner#160 is in.

@olavloite, does this address all your comments?

Yes, my comments have been addressed. I do think that it's important that we get the (integration) tests running soon, as there could still be corner cases that we haven't thought of yet that only occur incidentally.

I also have a separate concern regarding streaming iterators, but that should not block the merging of this PR.

c24t
c24t approved these changes Nov 17, 2020
@c24t
Copy link
Contributor

c24t commented Nov 17, 2020

This PR won't be merged here, see googleapis/python-spanner#168.

c24t pushed a commit to googleapis/python-spanner that referenced this pull request Nov 23, 2020
@IlyaFaer
Copy link
Contributor Author

Merged into the original Spanner client repo

@IlyaFaer IlyaFaer closed this Nov 30, 2020
@IlyaFaer IlyaFaer deleted the aborted_transactions_retry branch November 30, 2020 07:03
gcf-merge-on-green bot pushed a commit to googleapis/python-spanner that referenced this pull request Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/python-spanner-django API. cla: yes This human has signed the Contributor License Agreement. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transaction retry
3 participants