Transaction: on retry, replays should compare checksums of prior numbered statements that succeeded #34
Description
Coming here from a project that plans on adding Cloud Spanner as a backend for Django.
In AUTOCOMMIT=off mode, we need to hold a Transaction for perhaps an indefinitely long time.
Cloud Spanner will abort:
a) Transactions when not used for 10seconds or more -- we can periodically send a SELECT 1=1 to keep it active
b) Transactions even when refreshed, can and will abort. This is because Cloud Spanner has a high abort rate
Thus we need to retry Transactions!
Current retry
The current code for retrying in this repository is to just re-invoke the function that was passed into *.run_in_transaction afresh with a new Transaction per
python-spanner/google/cloud/spanner_v1/session.py
Lines 290 to 321 in 23916c5
Recommended retry
However, the correct way to retry Transactions as @bvandiver explained to me
You are getting quite close to the implementation in the open source JDBC driver. Rather than re-inventing things, I would suggest following their implementation. Of note, your current replay mechanism can lead to wrong answers. Imagine the canonical "transfer balance" transaction which decrements the balance in acct A, then increases the balance in acct B. However, between abort and retry someone deletes acct A - resulting in money magically appearing in acct B and no error (the update silently fails to update any rows). The long and the short of it is that you need to hash the results of all queries + DML and confirm on your retry that they give the same answers. You need query too (think a query to check if there was sufficient balance in acct A).
a) For every result returned by an operation on a Transaction, compute the checksum and add it a FIFO stack
b) At the point that a prior Transaction fails, that's the bottom of our stack
c) When retrying the Transaction from the first statement, compare its checksum with the same ordinal number/index on the FIFO stack -- if any of them don't match, abort the Transaction as not retryable
This is what the Java spanner-jdbc implementation does
Suggestion
The implementation of this feature when attempted outside of this package involves a whole lot of hacking since we need to consume the raw data sent to StreamedResultSets which requires then proto marshalling and wrapping StreamedResult -- quite non-ideal and will actually involve patches to python-spanner.
@bvandiver and I chatted again about this today and I also briefly raised this issue to @skuruppu this afternoon too.