Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rdkit_2d featurizer #877

Merged
merged 19 commits into from
Aug 27, 2024
Merged

Conversation

KnathanM
Copy link
Member

Description

People want the rdkit_2d_normalized featurizer available from the commandline, probably because it was helpful in v1 (the docs say "As a starting point, we recommend using pre-normalized RDKit features by using the --features_generator rdkit_2d_normalized --no_features_scaling flags"

This PR doesn't do that but is a step in that direction.

This PR implements a featurizer that makes the rdkit 2d features. But they aren't normalized, as that was done in Descriptastorus with precomputed scaling (here). Also Descriptastorus leaves out these descriptors : ['AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'SPS']. At least "SPS" was added more recently. I haven't checked the others.

Also note that "SPS" divides by the number of heavy atoms, so the featurizer has to check for that.

Questions

Would we prefer to just make Descriptastorus a required dependency again?

Checklist

  • (if appropriate) unit tests added?
    Tests not yet added

Copy link
Contributor

@shihchengli shihchengli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better not to add Descriptastorus as a required dependency if it is not too difficult to write our own versions and we accept additional lines of code for the implementation of RDKit2DNormalized.

chemprop/featurizers/molecule.py Outdated Show resolved Hide resolved
Comment on lines 62 to 63
NAN_TOKEN = 0
features[np.isnan(self.x_d)] = NAN_TOKEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is self.x_d?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.x_d is molecular descriptor for a data point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but if it is an attribute in the data points, can it be accessed here? Besides, it seems this code here already handled the issues about nan values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had copied this code (but accidentally left self.x_d) because some of the rdkit2d features would return nan (rdkit/rdkit#5824) and I feel it is the responsibility of the featurizer to not return nans.

But in any event, these features that would return nan are not calculated by descriptastorus. There are 12 features that rdkit descriptors have that Descriptastorus doesn't. But I figure the consistency with chemprop v1 is worth not having these extra features especially since a user can still calculate them manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding Descriptastorus back. Could you also include it in the environment.yml?

@kevingreenman kevingreenman added this to the v2.0.1 milestone May 23, 2024
@KnathanM
Copy link
Member Author

I have changed the implementation to rely on Descriptastorus. This is what was done in v1.

I will also note that the featurizer works for mols without any heavy atoms

from chemprop.featurizers.molecule import RDKit2DFeaturizer
from rdkit import Chem

mol = Chem.MolFromSmiles('[H][H].[H]')
featurizer = RDKit2DFeaturizer()
featurizer(mol)

@hwpang
Copy link
Contributor

hwpang commented May 24, 2024

Thanks for adding the new featurizer! Can you add unit test and/or CLI test for it?

@KnathanM KnathanM modified the milestones: v2.0.1, v2.0.x Jun 4, 2024
Copy link
Member

@JacksonBurns JacksonBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conda recipe will only work with packages on conda-forge, so we cannot use the git-based install.

@JacksonBurns
Copy link
Member

Elaborating on my comment above:

  • this PR has only added the dependency to the pyproject.toml, meaning that the environment.yml-based tests fail because it is missing
  • it can be added to environment.yml but not the conda recipe we use to distribute on conda-forge (meaning we would not have conda install chemprop anymore)
  • I think we should avoid adding descriptastorus as a required dependency, since the repository is rarely updated and has broken on us in the past

My suggested alternatives:

  • vendor just the one function we want
  • add our own feature calculator and standardizer

There's also a theoretical question about whether or not using the descriptastorus features is a data leak. My thinking is that the distributions used to rescale the features are parameterized based on some set that I don't know (does anyone?) that could possibly contain overlapping molecules with a testing or validation set, which would shift the distribution of said features to look more like training and thus constitute a data leak.

@KnathanM KnathanM force-pushed the rdkit_2d_features branch from e57d674 to a620924 Compare June 13, 2024 20:25
@KnathanM
Copy link
Member Author

Ready for another review.

I removed descriptastorus as a dependency (see below for why).

Can you add unit test and/or CLI test for it?

I added a clunky test, open to improvements on it. Mostly I wanted to make sure that the molecule featurizers don't change how they featurize (e.g. RDKit adds another molecule feature to its list), because Chemprop models from the CLI rely on the featurizers not changing.

I also added a note about using descriptastorus to calculate normalized features to the docs.

why remove descriptastorus

@JacksonBurns had some thoughts about this:

it can be added to environment.yml but not the conda recipe we use to distribute on conda-forge (meaning we would not have conda install chemprop anymore)

Turns out descriptastorus is on conda-forge: https://anaconda.org/conda-forge/descriptastorus

I think we should avoid adding descriptastorus as a required dependency, since the repository is rarely updated and has broken on us in the past

I also think that the previous trouble we had with descriptastorus wasn't that bad. The latest version has an error that makes it incompatible with scipy<1.9, but that version of scipy was released two years ago. The simple solution was to update scipy. This error was fixed two months ago on GitHub, but hasn't been released as a new version.

There's also a theoretical question about whether or not using the descriptastorus features is a data leak. My thinking is that the distributions used to rescale the features are parameterized based on some set that I don't know (does anyone?)

I think this is something to consider. The main reason I chose to not have descriptastorus available from the CLI is that I don't know where their scaling came from. Including it in the CLI is essentially a recommendation to use it as it only requires specifying a flag, which I don't feel I can recommend using if I don't know where it came from. Mostly I want to encourage chemprop users to be aware of what information their model is actually using to make predictions. If they are using descriptors, they should hopefully have some idea of what those descriptors are.

A secondary reason I chose to not use descriptastorus is that it is missing a few of the newer molecular features added to RDKit. See the linked issues for details.

@JacksonBurns
Copy link
Member

Didn't know it was on conda-forge! That changes my opinion, and I would be OK with it from a software perspective.

I'll note too that the incompatible version of scipy cannot even be installed on the versions of Python that Chemprop v2 supports, see: https://docs.scipy.org/doc/scipy/dev/toolchain.html#numpy

Someone else has also just opened an issue asking about the scaling on the descriptastorus repo: bp-kelley/descriptastorus#31 I'm hoping for a satisfactory answer.

@JacksonBurns
Copy link
Member

I've started a small test PR on the conda recipe to see if the build will work with descriptastorus.

else:
features = np.array([func(mol) for name, func in Descriptors.descList], float)

NAN_TOKEN = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I see the value in setting this here and then using it immediately versus just adding a comment that says 'set np.nan to zero'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main value is that it is consistent with how we deal with nans in the extra features/descriptors the user gives when making a data point (

). That said, we already take care of this in datapoint, so probably I can remove this nan_to_zero conversion and return an array that potentially has some nans. This is different than what I previously thought, that the featurizers were responsible for not returning nans. Currently, I'm of the opinion we shouldn't hide the nans from the user, except when we need to like when making the datapoints because the model can't take nans.

Comment on lines 61 to 66
if mol.GetNumHeavyAtoms() == 0:
features = np.array(
[0 if name == "SPS" else func(mol) for name, func in Descriptors.descList], float
)
else:
features = np.array([func(mol) for name, func in Descriptors.descList], float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if mol.GetNumHeavyAtoms() == 0:
features = np.array(
[0 if name == "SPS" else func(mol) for name, func in Descriptors.descList], float
)
else:
features = np.array([func(mol) for name, func in Descriptors.descList], float)
features = np.array(
[func(mol) for _, func in filter(lambda i: mol.GetNumHeavyAtoms() > 0 or i[0] != "SPS", Descriptors.descList)],
dtype=float,
)

I think this is the same and avoids duplicating a line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing that with filter. I switched am putting i[0] != "SPS" first so that we don't need to call GetNumHeavyAtoms most of the time, only when the function name is SPS.



Molecule-Level 2D Features
^^^^^^^^^^^^^^^^^^^^^^^^^^

Morgan fingerprints can be generated as molecular 2D features using :code:`--features-generators`:
Chemprop provides several molecule featurizers that automatically calculates molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Chemprop provides several molecule featurizers that automatically calculates molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following:
Chemprop provides several molecule featurizers that automatically calculate molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following:

tests/cli/test_cli_regression_mol.py Show resolved Hide resolved
@StefanIsSmart
Copy link

Hello, anyone know the answer about bp-kelley/descriptastorus#31 ?

@JacksonBurns
Copy link
Member

Hello, anyone know the answer about bp-kelley/descriptastorus#31 ?

No, we are also curious.


if not NO_DESCRIPTASTORUS:

class V1RDKit2DFeaturizerMixin(VectorFeaturizer[Mol]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between RDKit2DFeaturizer and V1RDKit2DFeaturizer? Should we just keep one of them to avoid confusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the new one has 10 more descriptors. Maybe we can just keep one of them.

'AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', and 'SPS'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had to keep just one, I would recommend it being the V1RDKit2DFeaturizer because if someone is trying to use a v1 model that used rdkit features, they would need that featurizer.

I added a new version because of the extra descriptors that have been added since descriptastorus. It could be nice to let people get these more descriptors. My intention is that all new models would use RDKit2DFeaturizer and that the V1 version would only be used for models that were trained in v1.

Let me know if you feel strongly, or we can bring it up in the next meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with keeping both of them. However, in the long term, I don't think people will care about V1RDKit2DFeaturizer anymore, as it will become a legacy we used in the past. If someone wants to do a benchmark with this smaller set, they should take responsibility for generating those features correctly via a notebook and provide them as additional features.

Comment on lines 108 to 112
def __init__(self):
raise ImportError(
"Failed to import descriptastorus. Please install descriptastorus "
"(https://github.com/bp-kelley/descriptastorus) to use RDKit 2D features."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that this is a mock implementation. You might already have an idea of how to implement it in a real implementation. Here is my two cents:
Move this into V1RDKit2DFeaturizerMixin and check if NO_DESCRIPTASTORUS is True.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

@shihchengli
Copy link
Contributor

shihchengli commented Aug 1, 2024

It seems we are going to add Descriptastorus as a dependency. Could you modify the pyproject.toml and environment.yml accordingly?

@JacksonBurns
Copy link
Member

I've started a small test PR on the conda recipe to see if the build will work with descriptastorus.

The linked PR on the conda recipe confirmed that the build will work.

@KnathanM
Copy link
Member Author

@shihchengli Am I missing any changes you'd like to see in this PR before merging?

Copy link
Contributor

@shihchengli shihchengli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving this PR along the way. These are my final comments on this PR. After considering and addressing them, I would be happy to merge it.

docs/source/tutorial/cli/train.rst Outdated Show resolved Hide resolved
docs/source/tutorial/cli/train.rst Outdated Show resolved Hide resolved
KnathanM and others added 3 commits August 24, 2024 15:59
SPS is a measure of the complexity of the molecule that depends on the number of heavy atoms

Co-authored-by: Shih-Cheng Li <scli@mit.edu>
Co-authored-by: Shih-Cheng Li <scli@mit.edu>
@KnathanM KnathanM requested a review from JacksonBurns August 24, 2024 20:24
Copy link
Member

@JacksonBurns JacksonBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and is ready to merge, I just want to clear something up for future development purposes.

One overall question - we allow for the import of descriptastorus to fail (i.e. allowing for it to not be installed) but we have it in the PyPI and Conda recipe (i.e. it will always be installed). Why? My initial thinking is that we would want to not make it required or not allow it to be missing.

@KnathanM
Copy link
Member Author

That's a good point. I think we are okay requiring descriptastorus because it available on PyPI and Conda. Let's see if the tests work when it is required. Current users will need to install descriptastorus themselves when they pull in these changes, but that shouldn't be too hard?

@JacksonBurns
Copy link
Member

Agreed, should be straightforward. Approving!

@JacksonBurns JacksonBurns self-requested a review August 27, 2024 14:44
@JacksonBurns JacksonBurns enabled auto-merge (squash) August 27, 2024 14:45
@JacksonBurns JacksonBurns merged commit d34f83f into chemprop:main Aug 27, 2024
13 checks passed
@KnathanM KnathanM deleted the rdkit_2d_features branch October 7, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants