add rdkit_2d featurizer #877

KnathanM · 2024-05-22T21:53:55Z

Description

People want the rdkit_2d_normalized featurizer available from the commandline, probably because it was helpful in v1 (the docs say "As a starting point, we recommend using pre-normalized RDKit features by using the --features_generator rdkit_2d_normalized --no_features_scaling flags"

This PR doesn't do that but is a step in that direction.

This PR implements a featurizer that makes the rdkit 2d features. But they aren't normalized, as that was done in Descriptastorus with precomputed scaling (here). Also Descriptastorus leaves out these descriptors : ['AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'SPS']. At least "SPS" was added more recently. I haven't checked the others.

Also note that "SPS" divides by the number of heavy atoms, so the featurizer has to check for that.

Questions

Would we prefer to just make Descriptastorus a required dependency again?

Checklist

(if appropriate) unit tests added?
Tests not yet added

shihchengli

I think it would be better not to add Descriptastorus as a required dependency if it is not too difficult to write our own versions and we accept additional lines of code for the implementation of RDKit2DNormalized.

chemprop/featurizers/molecule.py

shihchengli · 2024-05-23T05:56:18Z

chemprop/featurizers/molecule.py

+        NAN_TOKEN = 0
+        features[np.isnan(self.x_d)] = NAN_TOKEN


What is self.x_d?

self.x_d is molecular descriptor for a data point

I see, but if it is an attribute in the data points, can it be accessed here? Besides, it seems this code here already handled the issues about nan values.

I had copied this code (but accidentally left self.x_d) because some of the rdkit2d features would return nan (rdkit/rdkit#5824) and I feel it is the responsibility of the featurizer to not return nans.

But in any event, these features that would return nan are not calculated by descriptastorus. There are 12 features that rdkit descriptors have that Descriptastorus doesn't. But I figure the consistency with chemprop v1 is worth not having these extra features especially since a user can still calculate them manually.

Thanks for adding Descriptastorus back. Could you also include it in the environment.yml?

KnathanM · 2024-05-24T13:54:18Z

I have changed the implementation to rely on Descriptastorus. This is what was done in v1.

I will also note that the featurizer works for mols without any heavy atoms

from chemprop.featurizers.molecule import RDKit2DFeaturizer
from rdkit import Chem

mol = Chem.MolFromSmiles('[H][H].[H]')
featurizer = RDKit2DFeaturizer()
featurizer(mol)

hwpang · 2024-05-24T14:27:15Z

Thanks for adding the new featurizer! Can you add unit test and/or CLI test for it?

JacksonBurns

The conda recipe will only work with packages on conda-forge, so we cannot use the git-based install.

JacksonBurns · 2024-06-04T21:57:13Z

Elaborating on my comment above:

this PR has only added the dependency to the pyproject.toml, meaning that the environment.yml-based tests fail because it is missing
it can be added to environment.yml but not the conda recipe we use to distribute on conda-forge (meaning we would not have conda install chemprop anymore)
I think we should avoid adding descriptastorus as a required dependency, since the repository is rarely updated and has broken on us in the past

My suggested alternatives:

vendor just the one function we want
add our own feature calculator and standardizer

There's also a theoretical question about whether or not using the descriptastorus features is a data leak. My thinking is that the distributions used to rescale the features are parameterized based on some set that I don't know (does anyone?) that could possibly contain overlapping molecules with a testing or validation set, which would shift the distribution of said features to look more like training and thus constitute a data leak.

KnathanM · 2024-06-13T21:38:00Z

Ready for another review.

I removed descriptastorus as a dependency (see below for why).

Can you add unit test and/or CLI test for it?

I added a clunky test, open to improvements on it. Mostly I wanted to make sure that the molecule featurizers don't change how they featurize (e.g. RDKit adds another molecule feature to its list), because Chemprop models from the CLI rely on the featurizers not changing.

I also added a note about using descriptastorus to calculate normalized features to the docs.

why remove descriptastorus

@JacksonBurns had some thoughts about this:

it can be added to environment.yml but not the conda recipe we use to distribute on conda-forge (meaning we would not have conda install chemprop anymore)

Turns out descriptastorus is on conda-forge: https://anaconda.org/conda-forge/descriptastorus

I think we should avoid adding descriptastorus as a required dependency, since the repository is rarely updated and has broken on us in the past

I also think that the previous trouble we had with descriptastorus wasn't that bad. The latest version has an error that makes it incompatible with scipy<1.9, but that version of scipy was released two years ago. The simple solution was to update scipy. This error was fixed two months ago on GitHub, but hasn't been released as a new version.

There's also a theoretical question about whether or not using the descriptastorus features is a data leak. My thinking is that the distributions used to rescale the features are parameterized based on some set that I don't know (does anyone?)

I think this is something to consider. The main reason I chose to not have descriptastorus available from the CLI is that I don't know where their scaling came from. Including it in the CLI is essentially a recommendation to use it as it only requires specifying a flag, which I don't feel I can recommend using if I don't know where it came from. Mostly I want to encourage chemprop users to be aware of what information their model is actually using to make predictions. If they are using descriptors, they should hopefully have some idea of what those descriptors are.

A secondary reason I chose to not use descriptastorus is that it is missing a few of the newer molecular features added to RDKit. See the linked issues for details.

JacksonBurns · 2024-06-14T17:52:49Z

Didn't know it was on conda-forge! That changes my opinion, and I would be OK with it from a software perspective.

I'll note too that the incompatible version of scipy cannot even be installed on the versions of Python that Chemprop v2 supports, see: https://docs.scipy.org/doc/scipy/dev/toolchain.html#numpy

Someone else has also just opened an issue asking about the scaling on the descriptastorus repo: bp-kelley/descriptastorus#31 I'm hoping for a satisfactory answer.

JacksonBurns · 2024-06-17T14:00:01Z

I've started a small test PR on the conda recipe to see if the build will work with descriptastorus.

JacksonBurns · 2024-06-17T14:03:13Z

chemprop/featurizers/molecule.py

+        else:
+            features = np.array([func(mol) for name, func in Descriptors.descList], float)
+
+        NAN_TOKEN = 0


Not sure I see the value in setting this here and then using it immediately versus just adding a comment that says 'set np.nan to zero'

The main value is that it is consistent with how we deal with nans in the extra features/descriptors the user gives when making a data point (

chemprop/chemprop/data/datapoints.py

Line 35 in 06171ac

NAN_TOKEN = 0

). That said, we already take care of this in datapoint, so probably I can remove this nan_to_zero conversion and return an array that potentially has some nans. This is different than what I previously thought, that the featurizers were responsible for not returning nans. Currently, I'm of the opinion we shouldn't hide the nans from the user, except when we need to like when making the datapoints because the model can't take nans.

JacksonBurns · 2024-06-17T14:15:18Z

chemprop/featurizers/molecule.py

+        if mol.GetNumHeavyAtoms() == 0:
+            features = np.array(
+                [0 if name == "SPS" else func(mol) for name, func in Descriptors.descList], float
+            )
+        else:
+            features = np.array([func(mol) for name, func in Descriptors.descList], float)


Suggested change

if mol.GetNumHeavyAtoms() == 0:

features = np.array(

[0 if name == "SPS" else func(mol) for name, func in Descriptors.descList], float

)

else:

features = np.array([func(mol) for name, func in Descriptors.descList], float)

features = np.array(

[func(mol) for _, func in filter(lambda i: mol.GetNumHeavyAtoms() > 0 or i[0] != "SPS", Descriptors.descList)],

dtype=float,

)

I think this is the same and avoids duplicating a line

Thank you for writing that with filter. I switched am putting i[0] != "SPS" first so that we don't need to call GetNumHeavyAtoms most of the time, only when the function name is SPS.

JacksonBurns · 2024-06-17T14:16:31Z

docs/source/tutorial/cli/train.rst



 Molecule-Level 2D Features
 ^^^^^^^^^^^^^^^^^^^^^^^^^^

-Morgan fingerprints can be generated as molecular 2D features using :code:`--features-generators`:
+Chemprop provides several molecule featurizers that automatically calculates molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following:


Suggested change

Chemprop provides several molecule featurizers that automatically calculates molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following:

Chemprop provides several molecule featurizers that automatically calculate molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following:

tests/cli/test_cli_regression_mol.py

chemprop/featurizers/molecule.py

docs/source/tutorial/python/featurizers/molecule_featurizers.ipynb

StefanIsSmart · 2024-06-18T01:29:59Z

Hello, anyone know the answer about bp-kelley/descriptastorus#31 ?

JacksonBurns · 2024-06-18T01:35:49Z

Hello, anyone know the answer about bp-kelley/descriptastorus#31 ?

No, we are also curious.

shihchengli · 2024-08-01T15:02:25Z

chemprop/featurizers/molecule.py

+
+if not NO_DESCRIPTASTORUS:
+
+    class V1RDKit2DFeaturizerMixin(VectorFeaturizer[Mol]):


What is the difference between RDKit2DFeaturizer and V1RDKit2DFeaturizer? Should we just keep one of them to avoid confusion?

It seems the new one has 10 more descriptors. Maybe we can just keep one of them.

'AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', and 'SPS'

If we had to keep just one, I would recommend it being the V1RDKit2DFeaturizer because if someone is trying to use a v1 model that used rdkit features, they would need that featurizer.

I added a new version because of the extra descriptors that have been added since descriptastorus. It could be nice to let people get these more descriptors. My intention is that all new models would use RDKit2DFeaturizer and that the V1 version would only be used for models that were trained in v1.

Let me know if you feel strongly, or we can bring it up in the next meeting.

I am okay with keeping both of them. However, in the long term, I don't think people will care about V1RDKit2DFeaturizer anymore, as it will become a legacy we used in the past. If someone wants to do a benchmark with this smaller set, they should take responsibility for generating those features correctly via a notebook and provide them as additional features.

shihchengli · 2024-08-01T15:34:17Z

chemprop/featurizers/molecule.py

+        def __init__(self):
+            raise ImportError(
+                "Failed to import descriptastorus. Please install descriptastorus "
+                "(https://github.com/bp-kelley/descriptastorus) to use RDKit 2D features."
+            )


I saw that this is a mock implementation. You might already have an idea of how to implement it in a real implementation. Here is my two cents:
Move this into V1RDKit2DFeaturizerMixin and check if NO_DESCRIPTASTORUS is True.

tests/unit/featurizers/test_molecule.py

shihchengli · 2024-08-01T15:51:58Z

It seems we are going to add Descriptastorus as a dependency. Could you modify the pyproject.toml and environment.yml accordingly?

JacksonBurns · 2024-08-04T20:16:42Z

I've started a small test PR on the conda recipe to see if the build will work with descriptastorus.

The linked PR on the conda recipe confirmed that the build will work.

KnathanM · 2024-08-24T01:28:51Z

@shihchengli Am I missing any changes you'd like to see in this PR before merging?

chemprop/featurizers/molecule.py

shihchengli

Thanks for improving this PR along the way. These are my final comments on this PR. After considering and addressing them, I would be happy to merge it.

docs/source/tutorial/cli/train.rst

docs/source/tutorial/python/featurizers/molecule_featurizers.ipynb

SPS is a measure of the complexity of the molecule that depends on the number of heavy atoms Co-authored-by: Shih-Cheng Li <scli@mit.edu>

Co-authored-by: Shih-Cheng Li <scli@mit.edu>

JacksonBurns

This looks great and is ready to merge, I just want to clear something up for future development purposes.

One overall question - we allow for the import of descriptastorus to fail (i.e. allowing for it to not be installed) but we have it in the PyPI and Conda recipe (i.e. it will always be installed). Why? My initial thinking is that we would want to not make it required or not allow it to be missing.

KnathanM · 2024-08-27T14:38:56Z

That's a good point. I think we are okay requiring descriptastorus because it available on PyPI and Conda. Let's see if the tests work when it is required. Current users will need to install descriptastorus themselves when they pull in these changes, but that shouldn't be too hard?

JacksonBurns · 2024-08-27T14:44:42Z

Agreed, should be straightforward. Approving!

shihchengli reviewed May 23, 2024

View reviewed changes

kevingreenman added this to the v2.0.1 milestone May 23, 2024

KnathanM modified the milestones: v2.0.1, v2.0.x Jun 4, 2024

JacksonBurns requested changes Jun 4, 2024

View reviewed changes

KnathanM added 2 commits June 13, 2024 11:05

add rdkit_2d featurizer

b677fac

Add test and docs for rdkit features

a620924

KnathanM force-pushed the rdkit_2d_features branch from e57d674 to a620924 Compare June 13, 2024 20:25

improve docs for molecule featurizers

88408c0

JacksonBurns mentioned this pull request Jun 17, 2024

add descriptastorus to the recipe conda-forge/chemprop-feedstock#25

Closed

JacksonBurns reviewed Jun 17, 2024

View reviewed changes

shihchengli reviewed Jun 17, 2024

View reviewed changes

chemprop/featurizers/molecule.py Outdated Show resolved Hide resolved

docs/source/tutorial/python/featurizers/molecule_featurizers.ipynb Outdated Show resolved Hide resolved

KnathanM mentioned this pull request Jun 17, 2024

[v2 better design?]: Function override can cause unexpected and unclear featurization behavior #921

Open

KnathanM added 7 commits July 29, 2024 12:59

address comments

64d4c2f

allow featurizer to return nans

1f958ff

add v1 rdkit molecule featurizers

ebef8e4

Merge branch 'main' into rdkit_2d_features

965e461

fix merge errors

def761f

update documentation for rdkit descriptors

f45d914

lint

f83dadc

shihchengli reviewed Aug 1, 2024

View reviewed changes

tests/unit/featurizers/test_molecule.py Show resolved Hide resolved

KnathanM added 2 commits August 22, 2024 14:31

address comments

7b0b8a7

Merge branch 'main' into rdkit_2d_features

e24cf6d

shihchengli reviewed Aug 24, 2024

View reviewed changes

chemprop/featurizers/molecule.py Outdated Show resolved Hide resolved

shihchengli reviewed Aug 24, 2024

View reviewed changes

KnathanM and others added 3 commits August 24, 2024 15:59

fix missing feature when n_heavyatoms = 0

5b80ec7

SPS is a measure of the complexity of the molecule that depends on the number of heavy atoms Co-authored-by: Shih-Cheng Li <scli@mit.edu>

Apply suggestions from code review

44900b0

Co-authored-by: Shih-Cheng Li <scli@mit.edu>

fix typo

f3769cd

KnathanM requested a review from JacksonBurns August 24, 2024 20:24

Merge branch 'main' into rdkit_2d_features

c5b89ca

shihchengli approved these changes Aug 25, 2024

View reviewed changes

JacksonBurns reviewed Aug 27, 2024

View reviewed changes

KnathanM added 2 commits August 27, 2024 10:35

require descriptastorus

7e5208a

remove descriptastorus skip in tests

870e9c0

Merge branch 'main' into rdkit_2d_features

3717cec

JacksonBurns self-requested a review August 27, 2024 14:44

JacksonBurns approved these changes Aug 27, 2024

View reviewed changes

JacksonBurns enabled auto-merge (squash) August 27, 2024 14:45

JacksonBurns merged commit d34f83f into chemprop:main Aug 27, 2024
13 checks passed

KnathanM mentioned this pull request Aug 27, 2024

[v2 Feature Request]: automated rdkit2d features #843

Closed

KnathanM deleted the rdkit_2d_features branch October 7, 2024 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add rdkit_2d featurizer #877

add rdkit_2d featurizer #877

KnathanM commented May 22, 2024

shihchengli left a comment

shihchengli May 23, 2024

hwpang May 23, 2024

shihchengli May 24, 2024

KnathanM May 24, 2024

shihchengli May 24, 2024

KnathanM commented May 24, 2024

hwpang commented May 24, 2024 •

edited

Loading

JacksonBurns left a comment

JacksonBurns commented Jun 4, 2024

KnathanM commented Jun 13, 2024

JacksonBurns commented Jun 14, 2024

JacksonBurns commented Jun 17, 2024

JacksonBurns Jun 17, 2024

KnathanM Jul 29, 2024

JacksonBurns Jun 17, 2024

KnathanM Jul 29, 2024

JacksonBurns Jun 17, 2024

StefanIsSmart commented Jun 18, 2024

JacksonBurns commented Jun 18, 2024

shihchengli Aug 1, 2024

shihchengli Aug 1, 2024

KnathanM Aug 22, 2024

shihchengli Aug 23, 2024

shihchengli Aug 1, 2024

KnathanM Aug 22, 2024

shihchengli commented Aug 1, 2024 •

edited

Loading

JacksonBurns commented Aug 4, 2024

KnathanM commented Aug 24, 2024

shihchengli left a comment

JacksonBurns left a comment

KnathanM commented Aug 27, 2024

JacksonBurns commented Aug 27, 2024

	Chemprop provides several molecule featurizers that automatically calculates molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following:
	Chemprop provides several molecule featurizers that automatically calculate molecular features and uses them as extra datapoint descriptors. These are specified using :code:`--features-generators` followed by one or more of the following:


		if not NO_DESCRIPTASTORUS:

		class V1RDKit2DFeaturizerMixin(VectorFeaturizer[Mol]):

add rdkit_2d featurizer #877

add rdkit_2d featurizer #877

Conversation

KnathanM commented May 22, 2024

Description

Questions

Checklist

shihchengli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnathanM commented May 24, 2024

hwpang commented May 24, 2024 • edited Loading

JacksonBurns left a comment

Choose a reason for hiding this comment

JacksonBurns commented Jun 4, 2024

KnathanM commented Jun 13, 2024

why remove descriptastorus

JacksonBurns commented Jun 14, 2024

JacksonBurns commented Jun 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanIsSmart commented Jun 18, 2024

JacksonBurns commented Jun 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shihchengli commented Aug 1, 2024 • edited Loading

JacksonBurns commented Aug 4, 2024

KnathanM commented Aug 24, 2024

shihchengli left a comment

Choose a reason for hiding this comment

JacksonBurns left a comment

Choose a reason for hiding this comment

KnathanM commented Aug 27, 2024

JacksonBurns commented Aug 27, 2024

hwpang commented May 24, 2024 •

edited

Loading

shihchengli commented Aug 1, 2024 •

edited

Loading