Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix part of #3047: Collection contents migration #3209

Merged
merged 16 commits into from
Mar 28, 2017

Conversation

wxyxinyu
Copy link
Contributor

This is task 1 of #3047: added a new field called collection_contents, and copies the old contents of nodes into it.

@wxyxinyu
Copy link
Contributor Author

wxyxinyu commented Mar 16, 2017

Also, some explanation for design choices:

  1. The collection schema versions are structured such that migration is performed automatically upon loading (unlike in explorations), so the migration mapreduce job can't get the old version of the collection without significant changes here. I think this is as intended? This is why I changed the schema update command.
  2. The migration job states the version being migrated to. The current changes don't increment the version, but I intend for the next release to have these changes as well as the addition of the skills list, which will increment the schema version.

@seanlip seanlip removed their request for review March 16, 2017 09:49
@codecov-io
Copy link

codecov-io commented Mar 16, 2017

Codecov Report

Merging #3209 into develop will decrease coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3209      +/-   ##
===========================================
- Coverage    45.82%   45.79%   -0.04%     
===========================================
  Files          230      234       +4     
  Lines        18140    18203      +63     
  Branches      2933     2934       +1     
===========================================
+ Hits          8313     8336      +23     
- Misses        9827     9867      +40
Impacted Files Coverage Δ
core/templates/dev/head/pages/Base.js 7.69% <0%> (-69.24%) ⬇️
...es/dev/head/domain/sidebar/SidebarStatusService.js 3.7% <0%> (-22.23%) ⬇️
...ead/domain/exploration/InteractionObjectFactory.js 87.5% <0%> (-12.5%) ⬇️
.../exploration_editor/editor_tab/StateInteraction.js 38.65% <0%> (-1.61%) ⬇️
core/templates/dev/head/app.js 53.03% <0%> (-0.56%) ⬇️
...es/exploration_editor/editor_tab/StateResponses.js 18.01% <0%> (-0.13%) ⬇️
...ev/head/pages/exploration_player/PlayerServices.js 1.7% <0%> (ø) ⬆️
.../exploration_editor/history_tab/HistoryServices.js 93.9% <0%> (ø) ⬆️
...ges/exploration_editor/feedback_tab/FeedbackTab.js 0.72% <0%> (ø) ⬆️
...ev/head/pages/exploration_editor/EditorServices.js 36.01% <0%> (ø) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d7be26...ab02160. Read the comment docs.

Copy link
Member

@BenHenning BenHenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wxyxinyu, this is looking really good! I had a few high-level comments.

I think the idea was to called it collection_contents rather than collection_content (to avoid ambiguity with a state's content property).

Moreover, I think it's preferable to keep 'nodes' a top-level property of Collection. The collection_contents structure is only needed for the CollectionModel, this way we don't repeat 'collection' even inside the Collection object or its dicts. @seanlip do you have any input here?

# This takes additional 'from_version' and 'to_version' parameters for logging.
CMD_MIGRATE_SCHEMA_TO_LATEST_VERSION = 'migrate_schema_to_latest_version'
# This takes additional 'version' parameter for logging.
CMD_MIGRATE_SCHEMA = 'migrate_schema'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't change, since we could have 'migrate_schema' currently stored in the database. Why change it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above -- the collection migration structure is somewhat different from explorations, in that it performs any necessary migrations on reading the collection from the model, and not on save like in explorations. So, it doesn't make sense to have a 'from_version' and 'to_version'. Furthermore, I believe there have currently not been any migrations done (since this migration job is new), so changing it is ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not having done any migrations up to now is a good point and we can change the string because of it. However, I don't know if I understand what you mean by not needing a from/to version. Isn't that still useful if we wanted to go back and see which versions were migrated as part of the job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it back

self.from_version = change_dict['from_version']
self.to_version = change_dict['to_version']
elif self.cmd == CMD_MIGRATE_SCHEMA:
self.version = change_dict['version']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it back

Collection domain object class.
"""

def test_correct_collection_content_schema_conversion_methods_exist(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good tests, but it's probably also worth adding a new test for each specific migration function (both for the collection content schema and the full collection dict migration).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What tests are you referring to here? I believe I already have a test which tests that nodes gets migrated to collection_contents.

models.NAMES.base_model, models.NAMES.collection])


class CollectionMigrationJobManager(jobs.BaseMapReduceJobManager):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be CollectionMigrationJob.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

collection.validate()
except Exception as e:
logging.error(
'Collection %s failed non-strict validation: %s' %
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe collections have strict validation. :) Please update the error string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just failed validation is specific enough, no need to clarify with strict versus non-strict.

nodes to collection_content if collection_content is empty.
"""
# Create an exploration to put in the collection.
self.save_new_valid_exploration(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a valid exploration for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not.

# If collection_content is empty, attempt to retrieve nodes data from nodes
# instead. This is temporary, and intended to not break backwards
# compatibility before the migration job is run.
# TODO(wxy): Remove this after collection migration is completed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really remove this, can we? Oppia can be used by other organizations that might have older collections before receiving this version, so they could effectively skip the releases that have this code if they are slow to update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, actually, if it's dead code. They could upgrade piecemeal (to release X, first, do the migration, and then upgrade again to release Y). Similar to answer migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure -- reading from nodes seems to be a hacky thing that isn't really supposed to happen, so I think it's better not to have the code in the codebase. But Ben makes a good point about backwards compatibility :/

rights_manager.publish_collection(self.owner_id, self.COLLECTION_ID)

# This should not give an error.
collection_services.update_collection(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also validate that the commit actually happened?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -63,6 +63,11 @@ class CollectionModel(base_models.VersionedModel):
# A dict representing all explorations belonging to this collection.
nodes = ndb.JsonProperty(default={}, indexed=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment indicating this is now deprecated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxyxinyu see e.g. skill_tags in exploration/gae_models.py for an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -33,9 +33,10 @@ oppia.factory('CollectionObjectFactory', [
// This map acts as a fast way of looking up a collection node for a given
// exploration ID.
this._explorationIdToNodeIndexMap = {};
for (var i = 0; i < collectionBackendObject.nodes.length; i++) {
var collectionContent = collectionBackendObject.collection_content;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, prefer not to represent collection_contents in the actual structure of the collection (frontend or backend), just as part of the model. I think this will simplify your PR a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@seanlip
Copy link
Member

seanlip commented Mar 17, 2017

Moreover, I think it's preferable to keep 'nodes' a top-level property of Collection. The collection_contents structure is only needed for the CollectionModel, this way we don't repeat 'collection' even inside the Collection object or its dicts. @seanlip do you have any input here?

@BenHenning, sorry -- are you only talking about the frontend here, or do you mean the domain object in the backend as well?

In both cases, I don't think I mind either way -- one way keeps parity with the storage model, the other way might be easier to understand. I lean mildly towards the latter for understandability, but would defer to @wxyxinyu's thoughts, too. It's the storage layer we really need to get right.

Copy link
Member

@BenHenning BenHenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seanlip I was referring to changing structure in both the backend and frontend domain objects. The need for the contents dict is generally confusing and specific to the model and migration pipeline, whereas from a collection domain point of view the structure requirements are different and can be simpler. I think what Xinyu has right now is really clean and looks good.

@wxyxinyu thanks! This really does look great. I had a few more comments. Let's not worry about the backward compatibility bits for now; Sean and I talked about this a bit at the meeting last night and we're not yet sold on a plan. Let's discuss it more during next Monday's meeting.

Regarding the tests, I was referring to having specific tests to verify the migration from v1 to v2 is correct, v2 to v3, etc. We do this for explorations, but it's more for general yaml conversion. I think taking precedence from that is fine, but we ought to have specific tests for the conversion behavior itself (not just verifying the migration job works or that the migration methods are present).

# This takes additional 'from_version' and 'to_version' parameters for logging.
CMD_MIGRATE_SCHEMA_TO_LATEST_VERSION = 'migrate_schema_to_latest_version'
# This takes additional 'version' parameter for logging.
CMD_MIGRATE_SCHEMA = 'migrate_schema'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not having done any migrations up to now is a good point and we can change the string because of it. However, I don't know if I understand what you mean by not needing a from/to version. Isn't that still useful if we wanted to go back and see which versions were migrated as part of the job?

Note that the versioned_collection_contents being passed in is modified
in-place.
"""
versioned_collection_contents['schema_version'] = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make sure that current_version + 1 doesn't exceed the current schema version defined in feconf? Maybe worth raising an exception if that happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@staticmethod
def map(item):
if item.deleted:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just so that whoever is running this job has a good idea of overall usage. We actually should also yield the successful case, so that in all we would see the following output from running the job:

  • number of collections not migrated because they are deleted
  • number of collections successfully migrated
  • each error message encountered when validating collections prior to migration

The exploration migration job ought to also be updated, but that can be done passively (not even worth tracking a bug for it). This is a nice-to-have that we might as well do with the introduction of this job.

try:
collection.validate()
except Exception as e:
logging.error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted collections aren't an error, but showing the admin the general behavior of the job is really useful. Please see my earlier comment about what this job ought to output for effectiveness.

updated_collection.schema_version,
feconf.CURRENT_COLLECTION_SCHEMA_VERSION)
after_converted_yaml = updated_collection.to_yaml()
self.assertEqual(after_converted_yaml, yaml_before_migration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes you are right. I was thinking that collection dicts included the version of the collection itself (not the schema version), but they apparently don't so checking dict is basically the same as yaml (except yaml doesn't include the collection ID, but that's moot).

# instead. This is temporary, and intended to not break backwards
# compatibility before the migration job is run.
# TODO(wxy): Remove this after collection migration is completed.
if versioned_collection_contents['collection_contents'] == {}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More pythonic:

if not versioned_collection_contents['collection_contents']:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -134,23 +135,35 @@ def get_collection_from_model(collection_model, run_conversion=True):
"""

# Ensure the original collection model does not get altered.
versioned_collection = {
versioned_collection_contents = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could just do:

versioned_collection_contents = copy.deepcopy(collection_model.collection_contents)
versioned_collection_contents['schema_version'] = collection_model.schema_version

It's a bit awkward to have versioned_collection_contents['collection_contents'], though I'm a bit ambivalent and am okay with it staying as is (since it has parity with explorations). It just seems highly unlikely we would ever have a key conflict in the contents dict with schema_version and have an opportunity to clean this up, whereas with exploration states it's more complicated since key names are user provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is easier because we eventually need to copy versioned_collection_contents['collection_contents'] back into the model.

collection.validate()
except Exception as e:
logging.error(
'Collection %s failed non-strict validation: %s' %
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just failed validation is specific enough, no need to clarify with strict versus non-strict.

@@ -0,0 +1,152 @@
# coding: utf-8
#
# Copyright 201/c7 The Oppia Authors. All Rights Reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# job.
collection = collection_domain.Collection.create_default_collection(
self.COLLECTION_ID, 'A title', 'A Category', 'An Objective')
collection_services.save_new_collection(self.albert_id, collection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this be the newest schema version? See test_utils for adding a method which creates an early schema version of a collection, similar to what we have inside of test_utils for explorations. Happy to chat over Hangouts if you have any questions about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is actually supposed to be the latest version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes you're definitely right.

@seanlip
Copy link
Member

seanlip commented Mar 22, 2017

Just confirming that @BenHenning's comments on frontend/backend structure SGTM. (Thanks for clarifying, Ben!)

@BenHenning BenHenning assigned wxyxinyu and unassigned BenHenning Mar 23, 2017
Copy link
Member

@BenHenning BenHenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Xinyu. I had two small comments, but I'm still hoping to see:

Regarding the tests, I was referring to having specific tests to verify the migration from v1 to v2 is correct, v2 to v3, etc. We do this for explorations, but it's more for general yaml conversion. I think taking precedence from that is fine, but we ought to have specific tests for the conversion behavior itself (not just verifying the migration job works or that the migration methods are present).

Once we have more granular tests for the individual parts of the migration process, then this PR will LGTM.

@@ -397,6 +398,13 @@ def update_collection_contents_from_model(
Note that the versioned_collection_contents being passed in is modified
in-place.
"""
if (versioned_collection_contents['schema_version'] >
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't current_version + 1 be checked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, done

(item.id, e))
'Collection %s failed validation: %s' % (item.id, e))
yield (CollectionMigrationJob._ERROR_KEY,
'Collection %s failed validation: %s' % (item.id, e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to log the exception since it's being logged with logging.error() and will be available via production logs. We can't always guarantee whether a particular Python value can be properly encoded into Unicode, and in the event this exception can't it will cause the mapreduce shard to fail (and a lot of errors to be outputted to the logs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# job.
collection = collection_domain.Collection.create_default_collection(
self.COLLECTION_ID, 'A title', 'A Category', 'An Objective')
collection_services.save_new_collection(self.albert_id, collection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes you're definitely right.

@wxyxinyu
Copy link
Contributor Author

I'm not quite sure what tests you're referring to. In exp_domain_test, yaml forms of explorations are loaded from each old version. This is already present in collection_domain_test. Are there any other tests?

@wxyxinyu wxyxinyu removed their assignment Mar 28, 2017
Copy link
Member

@BenHenning BenHenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wxyxinyu!

@seanlip seanlip merged commit 8530045 into develop Mar 28, 2017
@seanlip seanlip deleted the collection-contents-migration branch March 28, 2017 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants