-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix part of #3047: Collection contents migration #3209
Conversation
Also, some explanation for design choices:
|
Codecov Report
@@ Coverage Diff @@
## develop #3209 +/- ##
===========================================
- Coverage 45.82% 45.79% -0.04%
===========================================
Files 230 234 +4
Lines 18140 18203 +63
Branches 2933 2934 +1
===========================================
+ Hits 8313 8336 +23
- Misses 9827 9867 +40
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wxyxinyu, this is looking really good! I had a few high-level comments.
I think the idea was to called it collection_contents
rather than collection_content
(to avoid ambiguity with a state's content property).
Moreover, I think it's preferable to keep 'nodes' a top-level property of Collection. The collection_contents
structure is only needed for the CollectionModel
, this way we don't repeat 'collection' even inside the Collection
object or its dicts. @seanlip do you have any input here?
core/domain/collection_domain.py
Outdated
# This takes additional 'from_version' and 'to_version' parameters for logging. | ||
CMD_MIGRATE_SCHEMA_TO_LATEST_VERSION = 'migrate_schema_to_latest_version' | ||
# This takes additional 'version' parameter for logging. | ||
CMD_MIGRATE_SCHEMA = 'migrate_schema' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't change, since we could have 'migrate_schema' currently stored in the database. Why change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above -- the collection migration structure is somewhat different from explorations, in that it performs any necessary migrations on reading the collection from the model, and not on save like in explorations. So, it doesn't make sense to have a 'from_version' and 'to_version'. Furthermore, I believe there have currently not been any migrations done (since this migration job is new), so changing it is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not having done any migrations up to now is a good point and we can change the string because of it. However, I don't know if I understand what you mean by not needing a from/to version. Isn't that still useful if we wanted to go back and see which versions were migrated as part of the job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it back
core/domain/collection_domain.py
Outdated
self.from_version = change_dict['from_version'] | ||
self.to_version = change_dict['to_version'] | ||
elif self.cmd == CMD_MIGRATE_SCHEMA: | ||
self.version = change_dict['version'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it back
Collection domain object class. | ||
""" | ||
|
||
def test_correct_collection_content_schema_conversion_methods_exist(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are good tests, but it's probably also worth adding a new test for each specific migration function (both for the collection content schema and the full collection dict migration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What tests are you referring to here? I believe I already have a test which tests that nodes gets migrated to collection_contents.
models.NAMES.base_model, models.NAMES.collection]) | ||
|
||
|
||
class CollectionMigrationJobManager(jobs.BaseMapReduceJobManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be CollectionMigrationJob
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
collection.validate() | ||
except Exception as e: | ||
logging.error( | ||
'Collection %s failed non-strict validation: %s' % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe collections have strict validation. :) Please update the error string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just failed validation
is specific enough, no need to clarify with strict versus non-strict.
nodes to collection_content if collection_content is empty. | ||
""" | ||
# Create an exploration to put in the collection. | ||
self.save_new_valid_exploration( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a valid exploration for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not.
core/domain/collection_services.py
Outdated
# If collection_content is empty, attempt to retrieve nodes data from nodes | ||
# instead. This is temporary, and intended to not break backwards | ||
# compatibility before the migration job is run. | ||
# TODO(wxy): Remove this after collection migration is completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't really remove this, can we? Oppia can be used by other organizations that might have older collections before receiving this version, so they could effectively skip the releases that have this code if they are slow to update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, actually, if it's dead code. They could upgrade piecemeal (to release X, first, do the migration, and then upgrade again to release Y). Similar to answer migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure -- reading from nodes seems to be a hacky thing that isn't really supposed to happen, so I think it's better not to have the code in the codebase. But Ben makes a good point about backwards compatibility :/
rights_manager.publish_collection(self.owner_id, self.COLLECTION_ID) | ||
|
||
# This should not give an error. | ||
collection_services.update_collection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also validate that the commit actually happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -63,6 +63,11 @@ class CollectionModel(base_models.VersionedModel): | |||
# A dict representing all explorations belonging to this collection. | |||
nodes = ndb.JsonProperty(default={}, indexed=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment indicating this is now deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxyxinyu see e.g. skill_tags in exploration/gae_models.py for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -33,9 +33,10 @@ oppia.factory('CollectionObjectFactory', [ | |||
// This map acts as a fast way of looking up a collection node for a given | |||
// exploration ID. | |||
this._explorationIdToNodeIndexMap = {}; | |||
for (var i = 0; i < collectionBackendObject.nodes.length; i++) { | |||
var collectionContent = collectionBackendObject.collection_content; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, prefer not to represent collection_contents
in the actual structure of the collection (frontend or backend), just as part of the model. I think this will simplify your PR a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@BenHenning, sorry -- are you only talking about the frontend here, or do you mean the domain object in the backend as well? In both cases, I don't think I mind either way -- one way keeps parity with the storage model, the other way might be easier to understand. I lean mildly towards the latter for understandability, but would defer to @wxyxinyu's thoughts, too. It's the storage layer we really need to get right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seanlip I was referring to changing structure in both the backend and frontend domain objects. The need for the contents dict is generally confusing and specific to the model and migration pipeline, whereas from a collection domain point of view the structure requirements are different and can be simpler. I think what Xinyu has right now is really clean and looks good.
@wxyxinyu thanks! This really does look great. I had a few more comments. Let's not worry about the backward compatibility bits for now; Sean and I talked about this a bit at the meeting last night and we're not yet sold on a plan. Let's discuss it more during next Monday's meeting.
Regarding the tests, I was referring to having specific tests to verify the migration from v1 to v2 is correct, v2 to v3, etc. We do this for explorations, but it's more for general yaml conversion. I think taking precedence from that is fine, but we ought to have specific tests for the conversion behavior itself (not just verifying the migration job works or that the migration methods are present).
core/domain/collection_domain.py
Outdated
# This takes additional 'from_version' and 'to_version' parameters for logging. | ||
CMD_MIGRATE_SCHEMA_TO_LATEST_VERSION = 'migrate_schema_to_latest_version' | ||
# This takes additional 'version' parameter for logging. | ||
CMD_MIGRATE_SCHEMA = 'migrate_schema' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not having done any migrations up to now is a good point and we can change the string because of it. However, I don't know if I understand what you mean by not needing a from/to version. Isn't that still useful if we wanted to go back and see which versions were migrated as part of the job?
core/domain/collection_domain.py
Outdated
Note that the versioned_collection_contents being passed in is modified | ||
in-place. | ||
""" | ||
versioned_collection_contents['schema_version'] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make sure that current_version + 1
doesn't exceed the current schema version defined in feconf
? Maybe worth raising an exception if that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@staticmethod | ||
def map(item): | ||
if item.deleted: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just so that whoever is running this job has a good idea of overall usage. We actually should also yield the successful case, so that in all we would see the following output from running the job:
- number of collections not migrated because they are deleted
- number of collections successfully migrated
- each error message encountered when validating collections prior to migration
The exploration migration job ought to also be updated, but that can be done passively (not even worth tracking a bug for it). This is a nice-to-have that we might as well do with the introduction of this job.
try: | ||
collection.validate() | ||
except Exception as e: | ||
logging.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted collections aren't an error, but showing the admin the general behavior of the job is really useful. Please see my earlier comment about what this job ought to output for effectiveness.
updated_collection.schema_version, | ||
feconf.CURRENT_COLLECTION_SCHEMA_VERSION) | ||
after_converted_yaml = updated_collection.to_yaml() | ||
self.assertEqual(after_converted_yaml, yaml_before_migration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes you are right. I was thinking that collection dicts included the version of the collection itself (not the schema version), but they apparently don't so checking dict is basically the same as yaml (except yaml doesn't include the collection ID, but that's moot).
core/domain/collection_services.py
Outdated
# instead. This is temporary, and intended to not break backwards | ||
# compatibility before the migration job is run. | ||
# TODO(wxy): Remove this after collection migration is completed. | ||
if versioned_collection_contents['collection_contents'] == {}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More pythonic:
if not versioned_collection_contents['collection_contents']:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -134,23 +135,35 @@ def get_collection_from_model(collection_model, run_conversion=True): | |||
""" | |||
|
|||
# Ensure the original collection model does not get altered. | |||
versioned_collection = { | |||
versioned_collection_contents = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could just do:
versioned_collection_contents = copy.deepcopy(collection_model.collection_contents)
versioned_collection_contents['schema_version'] = collection_model.schema_version
It's a bit awkward to have versioned_collection_contents['collection_contents']
, though I'm a bit ambivalent and am okay with it staying as is (since it has parity with explorations). It just seems highly unlikely we would ever have a key conflict in the contents dict with schema_version
and have an opportunity to clean this up, whereas with exploration states it's more complicated since key names are user provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is easier because we eventually need to copy versioned_collection_contents['collection_contents']
back into the model.
collection.validate() | ||
except Exception as e: | ||
logging.error( | ||
'Collection %s failed non-strict validation: %s' % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just failed validation
is specific enough, no need to clarify with strict versus non-strict.
@@ -0,0 +1,152 @@ | |||
# coding: utf-8 | |||
# | |||
# Copyright 201/c7 The Oppia Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# job. | ||
collection = collection_domain.Collection.create_default_collection( | ||
self.COLLECTION_ID, 'A title', 'A Category', 'An Objective') | ||
collection_services.save_new_collection(self.albert_id, collection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this be the newest schema version? See test_utils
for adding a method which creates an early schema version of a collection, similar to what we have inside of test_utils
for explorations. Happy to chat over Hangouts if you have any questions about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is actually supposed to be the latest version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes you're definitely right.
Just confirming that @BenHenning's comments on frontend/backend structure SGTM. (Thanks for clarifying, Ben!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Xinyu. I had two small comments, but I'm still hoping to see:
Regarding the tests, I was referring to having specific tests to verify the migration from v1 to v2 is correct, v2 to v3, etc. We do this for explorations, but it's more for general yaml conversion. I think taking precedence from that is fine, but we ought to have specific tests for the conversion behavior itself (not just verifying the migration job works or that the migration methods are present).
Once we have more granular tests for the individual parts of the migration process, then this PR will LGTM.
core/domain/collection_domain.py
Outdated
@@ -397,6 +398,13 @@ def update_collection_contents_from_model( | |||
Note that the versioned_collection_contents being passed in is modified | |||
in-place. | |||
""" | |||
if (versioned_collection_contents['schema_version'] > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't current_version + 1
be checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, done
(item.id, e)) | ||
'Collection %s failed validation: %s' % (item.id, e)) | ||
yield (CollectionMigrationJob._ERROR_KEY, | ||
'Collection %s failed validation: %s' % (item.id, e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to log the exception since it's being logged with logging.error()
and will be available via production logs. We can't always guarantee whether a particular Python value can be properly encoded into Unicode, and in the event this exception can't it will cause the mapreduce shard to fail (and a lot of errors to be outputted to the logs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# job. | ||
collection = collection_domain.Collection.create_default_collection( | ||
self.COLLECTION_ID, 'A title', 'A Category', 'An Objective') | ||
collection_services.save_new_collection(self.albert_id, collection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes you're definitely right.
I'm not quite sure what tests you're referring to. In exp_domain_test, yaml forms of explorations are loaded from each old version. This is already present in collection_domain_test. Are there any other tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wxyxinyu!
This is task 1 of #3047: added a new field called collection_contents, and copies the old contents of nodes into it.