Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rdkit_2d featurizer #877

Merged
merged 19 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
lint
  • Loading branch information
KnathanM committed Jul 29, 2024
commit f83dadc5df33269bc7ec78be95ea80856115eb28
2 changes: 2 additions & 0 deletions chemprop/featurizers/molecule.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ def __call__(self, mol: Chem.Mol) -> np.ndarray:


if not NO_DESCRIPTASTORUS:

class V1RDKit2DFeaturizerMixin(VectorFeaturizer[Mol]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between RDKit2DFeaturizer and V1RDKit2DFeaturizer? Should we just keep one of them to avoid confusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the new one has 10 more descriptors. Maybe we can just keep one of them.

'AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', and 'SPS'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had to keep just one, I would recommend it being the V1RDKit2DFeaturizer because if someone is trying to use a v1 model that used rdkit features, they would need that featurizer.

I added a new version because of the extra descriptors that have been added since descriptastorus. It could be nice to let people get these more descriptors. My intention is that all new models would use RDKit2DFeaturizer and that the V1 version would only be used for models that were trained in v1.

Let me know if you feel strongly, or we can bring it up in the next meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with keeping both of them. However, in the long term, I don't think people will care about V1RDKit2DFeaturizer anymore, as it will become a legacy we used in the past. If someone wants to do a benchmark with this smaller set, they should take responsibility for generating those features correctly via a notebook and provide them as additional features.

def __len__(self) -> int:
return 200
Expand All @@ -99,6 +100,7 @@ def __init__(self):
self.generator = rdNormalizedDescriptors.RDKit2DNormalized()

else:

@MoleculeFeaturizerRegistry("v1_rdkit_2d")
class V1RDKit2DFeaturizer:
"""Mock implementation raising an ImportError if descriptastorus cannot be imported."""
Expand Down
9 changes: 7 additions & 2 deletions tests/unit/featurizers/test_molecule.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,12 @@
import pytest
from rdkit import Chem

from chemprop.featurizers import MorganBinaryFeaturizer, MorganCountFeaturizer, RDKit2DFeaturizer, V1RDKit2DFeaturizer
from chemprop.featurizers import (
MorganBinaryFeaturizer,
MorganCountFeaturizer,
RDKit2DFeaturizer,
V1RDKit2DFeaturizer,
)
from chemprop.featurizers.molecule import NO_DESCRIPTASTORUS


Expand Down Expand Up @@ -183,4 +188,4 @@ def test_v1_rdkit_2d(mol, v1_rdkit_2d_values):
featurizer = V1RDKit2DFeaturizer()
features = featurizer(mol)

np.testing.assert_array_almost_equal(features, v1_rdkit_2d_values, decimal=2)
np.testing.assert_array_almost_equal(features, v1_rdkit_2d_values, decimal=2)
Loading