-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add rdkit_2d featurizer #877
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better not to add Descriptastorus as a required dependency if it is not too difficult to write our own versions and we accept additional lines of code for the implementation of RDKit2DNormalized.
chemprop/featurizers/molecule.py
Outdated
NAN_TOKEN = 0 | ||
features[np.isnan(self.x_d)] = NAN_TOKEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is self.x_d
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.x_d is molecular descriptor for a data point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but if it is an attribute in the data points, can it be accessed here? Besides, it seems this code here already handled the issues about nan values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had copied this code (but accidentally left self.x_d) because some of the rdkit2d features would return nan (rdkit/rdkit#5824) and I feel it is the responsibility of the featurizer to not return nans.
But in any event, these features that would return nan are not calculated by descriptastorus. There are 12 features that rdkit descriptors have that Descriptastorus doesn't. But I figure the consistency with chemprop v1 is worth not having these extra features especially since a user can still calculate them manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding Descriptastorus back. Could you also include it in the environment.yml?
I have changed the implementation to rely on Descriptastorus. This is what was done in v1. I will also note that the featurizer works for mols without any heavy atoms
|
Thanks for adding the new featurizer! Can you add unit test and/or CLI test for it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conda recipe will only work with packages on conda-forge
, so we cannot use the git-based install.
Elaborating on my comment above:
My suggested alternatives:
There's also a theoretical question about whether or not using the |
e57d674
to
a620924
Compare
Ready for another review. I removed descriptastorus as a dependency (see below for why).
I added a clunky test, open to improvements on it. Mostly I wanted to make sure that the molecule featurizers don't change how they featurize (e.g. RDKit adds another molecule feature to its list), because Chemprop models from the CLI rely on the featurizers not changing. I also added a note about using descriptastorus to calculate normalized features to the docs. why remove descriptastorus@JacksonBurns had some thoughts about this:
Turns out descriptastorus is on conda-forge: https://anaconda.org/conda-forge/descriptastorus
I also think that the previous trouble we had with descriptastorus wasn't that bad. The latest version has an error that makes it incompatible with scipy<1.9, but that version of scipy was released two years ago. The simple solution was to update scipy. This error was fixed two months ago on GitHub, but hasn't been released as a new version.
I think this is something to consider. The main reason I chose to not have descriptastorus available from the CLI is that I don't know where their scaling came from. Including it in the CLI is essentially a recommendation to use it as it only requires specifying a flag, which I don't feel I can recommend using if I don't know where it came from. Mostly I want to encourage chemprop users to be aware of what information their model is actually using to make predictions. If they are using descriptors, they should hopefully have some idea of what those descriptors are. A secondary reason I chose to not use descriptastorus is that it is missing a few of the newer molecular features added to RDKit. See the linked issues for details. |
Didn't know it was on I'll note too that the incompatible version of Someone else has also just opened an issue asking about the scaling on the |
I've started a small test PR on the conda recipe to see if the build will work with |
chemprop/featurizers/molecule.py
Outdated
else: | ||
features = np.array([func(mol) for name, func in Descriptors.descList], float) | ||
|
||
NAN_TOKEN = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I see the value in setting this here and then using it immediately versus just adding a comment that says 'set np.nan to zero'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main value is that it is consistent with how we deal with nans in the extra features/descriptors the user gives when making a data point (
chemprop/chemprop/data/datapoints.py
Line 35 in 06171ac
NAN_TOKEN = 0 |
chemprop/featurizers/molecule.py
Outdated
if mol.GetNumHeavyAtoms() == 0: | ||
features = np.array( | ||
[0 if name == "SPS" else func(mol) for name, func in Descriptors.descList], float | ||
) | ||
else: | ||
features = np.array([func(mol) for name, func in Descriptors.descList], float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if mol.GetNumHeavyAtoms() == 0: | |
features = np.array( | |
[0 if name == "SPS" else func(mol) for name, func in Descriptors.descList], float | |
) | |
else: | |
features = np.array([func(mol) for name, func in Descriptors.descList], float) | |
features = np.array( | |
[func(mol) for _, func in filter(lambda i: mol.GetNumHeavyAtoms() > 0 or i[0] != "SPS", Descriptors.descList)], | |
dtype=float, | |
) |
I think this is the same and avoids duplicating a line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing that with filter
. I switched am putting i[0] != "SPS"
first so that we don't need to call GetNumHeavyAtoms
most of the time, only when the function name is SPS.
docs/source/tutorial/cli/train.rst
Outdated
|
||
|
||
Molecule-Level 2D Features | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Morgan fingerprints can be generated as molecular 2D features using :code:`--features-generators`: | ||
Chemprop provides several molecule featurizers that automatically calculates molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chemprop provides several molecule featurizers that automatically calculates molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following: | |
Chemprop provides several molecule featurizers that automatically calculate molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following: |
docs/source/tutorial/python/featurizers/molecule_featurizers.ipynb
Outdated
Show resolved
Hide resolved
Hello, anyone know the answer about bp-kelley/descriptastorus#31 ? |
No, we are also curious. |
chemprop/featurizers/molecule.py
Outdated
|
||
if not NO_DESCRIPTASTORUS: | ||
|
||
class V1RDKit2DFeaturizerMixin(VectorFeaturizer[Mol]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between RDKit2DFeaturizer
and V1RDKit2DFeaturizer
? Should we just keep one of them to avoid confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the new one has 10 more descriptors. Maybe we can just keep one of them.
'AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', and 'SPS'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had to keep just one, I would recommend it being the V1RDKit2DFeaturizer because if someone is trying to use a v1 model that used rdkit features, they would need that featurizer.
I added a new version because of the extra descriptors that have been added since descriptastorus. It could be nice to let people get these more descriptors. My intention is that all new models would use RDKit2DFeaturizer and that the V1 version would only be used for models that were trained in v1.
Let me know if you feel strongly, or we can bring it up in the next meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with keeping both of them. However, in the long term, I don't think people will care about V1RDKit2DFeaturizer anymore, as it will become a legacy we used in the past. If someone wants to do a benchmark with this smaller set, they should take responsibility for generating those features correctly via a notebook and provide them as additional features.
chemprop/featurizers/molecule.py
Outdated
def __init__(self): | ||
raise ImportError( | ||
"Failed to import descriptastorus. Please install descriptastorus " | ||
"(https://github.com/bp-kelley/descriptastorus) to use RDKit 2D features." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that this is a mock implementation. You might already have an idea of how to implement it in a real implementation. Here is my two cents:
Move this into V1RDKit2DFeaturizerMixin
and check if NO_DESCRIPTASTORUS
is True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
It seems we are going to add Descriptastorus as a dependency. Could you modify the pyproject.toml and environment.yml accordingly? |
The linked PR on the conda recipe confirmed that the build will work. |
@shihchengli Am I missing any changes you'd like to see in this PR before merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving this PR along the way. These are my final comments on this PR. After considering and addressing them, I would be happy to merge it.
docs/source/tutorial/python/featurizers/molecule_featurizers.ipynb
Outdated
Show resolved
Hide resolved
docs/source/tutorial/python/featurizers/molecule_featurizers.ipynb
Outdated
Show resolved
Hide resolved
SPS is a measure of the complexity of the molecule that depends on the number of heavy atoms Co-authored-by: Shih-Cheng Li <scli@mit.edu>
Co-authored-by: Shih-Cheng Li <scli@mit.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great and is ready to merge, I just want to clear something up for future development purposes.
One overall question - we allow for the import of descriptastorus
to fail (i.e. allowing for it to not be installed) but we have it in the PyPI and Conda recipe (i.e. it will always be installed). Why? My initial thinking is that we would want to not make it required or not allow it to be missing.
That's a good point. I think we are okay requiring descriptastorus because it available on PyPI and Conda. Let's see if the tests work when it is required. Current users will need to install descriptastorus themselves when they pull in these changes, but that shouldn't be too hard? |
Agreed, should be straightforward. Approving! |
Description
People want the rdkit_2d_normalized featurizer available from the commandline, probably because it was helpful in v1 (the docs say "As a starting point, we recommend using pre-normalized RDKit features by using the --features_generator rdkit_2d_normalized --no_features_scaling flags"
This PR doesn't do that but is a step in that direction.
This PR implements a featurizer that makes the rdkit 2d features. But they aren't normalized, as that was done in Descriptastorus with precomputed scaling (here). Also Descriptastorus leaves out these descriptors :
['AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'SPS']
. At least "SPS" was added more recently. I haven't checked the others.Also note that "SPS" divides by the number of heavy atoms, so the featurizer has to check for that.
Questions
Would we prefer to just make Descriptastorus a required dependency again?
Checklist
Tests not yet added