-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug](schema change)fix schema change cause load failed due to err -215 #23836
base: master
Are you sure you want to change the base?
Conversation
d0b2d08
to
6eb39ab
Compare
(From new machine)TeamCity pipeline, clickbench performance test result: |
(From new machine)TeamCity pipeline, clickbench performance test result: |
(From new machine)TeamCity pipeline, clickbench performance test result: |
run buildall |
(From new machine)TeamCity pipeline, clickbench performance test result: |
7a45778
to
871005b
Compare
(From new machine)TeamCity pipeline, clickbench performance test result: |
871005b
to
c28c805
Compare
(From new machine)TeamCity pipeline, clickbench performance test result: |
(From new machine)TeamCity pipeline, clickbench performance test result: |
aab7203
to
273e1fe
Compare
(From new machine)TeamCity pipeline, clickbench performance test result: |
2643cba
to
8ebafe7
Compare
(From new machine)TeamCity pipeline, clickbench performance test result: |
e9f7f1f
to
46e06fe
Compare
I will try to implement this for cloud model in the future. |
run p0 |
run buildall |
try { | ||
this.deleteTabletWatermarkTxnId = | ||
Env.getCurrentGlobalTransactionMgr().getNextTransactionId(); | ||
} catch (UserException e) { | ||
LOG.warn("get next transaction id failed"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assign method could be better put into AlterJob.java, just add an unified method like:
protected void assignDeleteTabletWatermarkTxnId() {
try {
this.deleteTabletWatermarkTxnId =
Env.getCurrentGlobalTransactionMgr().getNextTransactionId();
} catch (UserException e) {
LOG.warn("get next transaction id failed");
}
}
and we cloud just call assignDeleteTabletWatermarkTxnId() here and also in SchemaChangeJobV2, even in CloudSchemaChangeJob and CloudRollupJob in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
TPC-H: Total hot run time: 40319 ms
|
It seems load job come after the water mark but before tablet decomission will still run into this problem. However, this PR may prevernt -215 in most cases. I have no better ideas currently @DarvenDuan @dataroaring . |
TPC-DS: Total hot run time: 189677 ms
|
ClickBench: Total hot run time: 30.79 s
|
Doris holds the table's writeLock,then sets tablets to decommission and deletes index infos( |
Get it, not problems for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by anyone and no changes requested. |
TPC-H: Total hot run time: 40063 ms
|
TPC-DS: Total hot run time: 189973 ms
|
ClickBench: Total hot run time: 30.01 s
|
TPC-H: Total hot run time: 37425 ms
|
TPC-DS: Total hot run time: 183972 ms
|
ClickBench: Total hot run time: 30.99 s
|
816efb6
to
7c7bda4
Compare
7c7bda4
to
398082d
Compare
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Proposed changes
If Doris schema change job and load job execute in parallel, load job may be failed after schema change job finished.
Schema change job will generate a new shadow index for loading new data stream and convert history data. After schema change job finished, FE will delete the origin index and its' tablet in fe's meta, and then sends drop tablet task to BE to drop origin tablet meta and data in BE. But if a load job has not finished, which is loading data to both origin tablet and new tablet, it will fail due to OLAP_ERR_TABLE_NOT_FOUND.
This PR implements that Doris will not delete tablets of origin index immediately when schema change job is finished, but set the tablets' state to DECOMMISSION, Doris will delete those tablets later after all transactions on those tablets are finished.
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...