MNT Use float64_t and intp_t directly in Cython for _pairwise_distances_reduction #24153

thomasjpfan · 2022-08-09T17:51:14Z

Reference Issues/PRs

Related to #23865

What does this implement/fix? Explain your changes.

As noted in #23865 (comment), I agree that using DTYPE_t and ITYPE_t is a bit confusing and that using float64_t and intp_t is easier to understand.

jjerphan

Thanks for starting this, @thomasjpfan.

In #23865 (comment), you mentioned that we could even get ride of NumPy declaration of types and use the concrete types directly.

Which do you prefer?

Also, this PR modifies _hashing_fast.pyx, which is out of the scope of _pairwise_distances_reduction. I am fine modifying other Cython sources for consistency in this PR or in smaller ones.

What do you think?

ogrisel

I think I would be in favor of smaller independent PRs:

one for the metadata move (it seems quite unrelated to the Cython types, isn't it?)
and one PR per independent submodule with Cython code.

ogrisel · 2022-08-10T13:20:24Z

sklearn/_min_dependencies.py

@@ -29,7 +29,7 @@
    "scipy": (SCIPY_MIN_VERSION, "build, install"),
    "joblib": (JOBLIB_MIN_VERSION, "install"),
    "threadpoolctl": (THREADPOOLCTL_MIN_VERSION, "install"),
-    "cython": (CYTHON_MIN_VERSION, "build"),
+    "Cython": (CYTHON_MIN_VERSION, "build"),


Question: is the casing impacting a build step? Is this related to the use of pyproject.toml / setuptools.build_meta?

Pushing the pyproject.toml commit was a mistake.

I was experimenting with moving dependencies to pyproject.toml and it ended up in this branch.

As for the question, the case is not required. I had a strict check to make sure pyproject.toml was consistent with _min_dependencies.py which was case-sensitive.

This reverts commit f9bb33e.

thomasjpfan · 2022-08-10T13:40:18Z

Also, this PR modifies _hashing_fast.pyx

There was a dependency chain that gets to _hashing_fast. Updating _pairwise_distances_reduction requires updating _vector_sentinel which ends up modifying _hashing_fast.

thomasjpfan · 2022-08-10T13:59:25Z

In #23865 (comment), you mentioned that we could even get ride of NumPy declaration of types and use the concrete types directly.

Although, I am in favor of using float and double, I suspect cnp.float32 and cnp.float64 is easier to understand. When we call the NumPy's Python API with dtype={np.float{32,64} there is a direct one-to-one connect to cnp.float{32,64}.

For this PR, we also need to replace np.intp with Py_intptr_t:

https://github.com/numpy/numpy/blob/bb95cf015165bcf85f99ba0d28a7a6e727ca955b/numpy/__init__.pxd#L24
https://github.com/numpy/numpy/blob/a642e1f0437571be82966069484dd359732c852b/numpy/__init__.pxd#L721

@jjerphan Do you have an opinion on {float,double} vs {cnp.float32, cnp.float64}?

thomasjpfan · 2022-08-10T14:42:42Z

Looking at the discussion in #23865 (comment), I'm okay with going with cnp.float{32, 64}.

jjerphan · 2022-08-10T16:00:19Z

I confirm that I find numpy aliases (i.e. cnp.{float32, cnp.float64}_t and np.{float32, np.float64}) nicer to work with as they are explicit.

jeremiedbb · 2022-08-31T11:37:18Z

I'm also strongly in favor of using cnp.float32/64_t. It reduces confusion a lot

Micky774 · 2022-09-20T19:47:43Z

For this PR, we also need to replace np.intp with Py_intptr_t:

This is far more pedantic than we really need to care about, but by PEP 0353 I believe the canonical choice is Py_ssize_t. One the NumPy side of things everything is compatible with Py_ssize_t afaik.

Relevant from PEP 0353:

All occurrences of index and length parameters and results are changed to use Py_ssize_t, including the sequence slots in type objects, and the buffer interface.

jeremiedbb

LGTM.

I don't have a strong opinion between np.intp_t or Py_ssize_t to loop through arrays. Both are platform dependent which I'm not very enthusiastic about. We could also use np.int32/64_t to not be platform dependent and fully explicit.

jjerphan · 2022-10-07T07:10:25Z

Thinking again, I think having consistency and uniformity for types and fused types definitely makes sense. Yet, I am not entirely sure directly using concrete types is the best option because we sometimes might want to have some semantic for types.

I think that there are various usages and aliases for types and fused types (some of which might be relevant) in the code base and that it would be appropriate to perform a comprehensive inspection of those. What do you think?

(My previous remark regarding using types aliases from np and cnp rather than C concrete types directly (#24153 (comment)) still holds.)

Micky774 · 2022-10-25T23:10:55Z

I think that there are various usages and aliases for types and fused types (some of which might be relevant) in the code base and that it would be appropriate to perform a comprehensive inspection of those. What do you think?

Honestly, the approach of library-wide aliases has been more frustrating than helpful to me personally in the past. One example is as we're trying to figure out how to support things such as sparse indices dtypes, e.g. #23731. I feel inclined to prefer explicit typing by default, and file-specific aliasing as needed.

This is especially important in algorithms with more nuanced fused-type behavior -- for example, what if two arrays both have the same type semantics, but we need to generate the cross-product of their fused types? Something like some_func(SPARSE_IDX[:] A, SPARSE_IDX[:] B) demands A,B are the exact same type, however if we want a full a cross-product then we need to explicitly define redundant types, e.g. SPARSE_IDX_A, SPARSE_IDX_B. This means that we would then have:

global scikit-learn defined types (SPARSE_IDX)
transient types for code generation (SPARSE_IDX_{A,B})
potentially special-case fused-types for niche purposes (SPARSE_IDX_UINT32)

In such cases, the presence of the global defined type is more obstructive than helpful in my opinion. This would also pose a barrier to entry to new Cython developers that may have difficulty differentiating the semantics of global aliases vs the transient aliases that Cython sometimes demands for code generation.

I'm not super opposed to this, however I have definitely felt the friction due to it before.

Vincent-Maladiere · 2022-10-28T08:59:36Z

As a new Cython developer, I found navigating between types and fuse types a bit tedious initially. In #24556, I would have preferred using cnp.{float32, cnp.float64}_t and np.{float32, np.float64} directly, although I'm probably not aware of the semantic gains that aliases could yield.

So definitely sharing the opinion that limiting the number of types definition levels would help onboarding new users.

lorentzenchr · 2023-02-09T09:51:21Z

Let us discuss in #25572.

jjerphan · 2023-03-20T15:34:21Z

I think this PR can be closed since the agreed solution discussed in #25572 is different from the proposal of this PR. What do you think, @thomasjpfan?

jeremiedbb · 2023-03-20T15:36:38Z

Note that #25810 supersedes this PR with the solution decided in #25572

thomasjpfan · 2023-03-20T15:55:52Z

Yup, let's close this PR.

MNT Use float64_t and intp_t directly in Cython

4ffe1c5

github-actions bot added cython module:feature_extraction module:metrics module:utils labels Aug 9, 2022

jjerphan reviewed Aug 10, 2022

View reviewed changes

jjerphan mentioned this pull request Aug 10, 2022

FEA Add support for float32 on PairwiseDistancesReduction using Tempita #23865

Merged

MNT Adds more metadata into pyproject.toml

f9bb33e

ogrisel reviewed Aug 10, 2022

View reviewed changes

Revert "MNT Adds more metadata into pyproject.toml"

aaf7231

This reverts commit f9bb33e.

thomasjpfan added 2 commits August 10, 2022 10:21

Merge remote-tracking branch 'upstream/main' into remove_typedef_usage

42ab828

FIX Fixes merge errors

2b5edc8

Merge remote-tracking branch 'upstream/main' into remove_typedef_usage

debfedc

thomasjpfan added 3 commits August 31, 2022 12:51

Merge remote-tracking branch 'upstream/main' into remove_typedef_usage

972265b

Merge remote-tracking branch 'upstream/main' into remove_typedef_usage

93db00e

FIX Corrects wrong dtype

c357102

thomasjpfan added 2 commits September 21, 2022 17:39

Merge remote-tracking branch 'upstream/main' into remove_typedef_usage

e1b3768

FIX Fixes merge errors

c396166

jeremiedbb approved these changes Sep 22, 2022

View reviewed changes

jjerphan mentioned this pull request Feb 9, 2023

MAINT Remove ReadonlyArrayWrapper from _loss module #25555

Merged

lorentzenchr mentioned this pull request Feb 9, 2023

RFC Guideline for usage of Cython types #25572

Closed

jeremiedbb mentioned this pull request Mar 10, 2023

MAINT Consistent cython types continued #25810

Merged

thomasjpfan closed this Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT Use float64_t and intp_t directly in Cython for _pairwise_distances_reduction #24153

MNT Use float64_t and intp_t directly in Cython for _pairwise_distances_reduction #24153

thomasjpfan commented Aug 9, 2022

jjerphan left a comment

ogrisel left a comment

ogrisel Aug 10, 2022

thomasjpfan Aug 10, 2022 •

edited

Loading

thomasjpfan Aug 10, 2022

thomasjpfan commented Aug 10, 2022

thomasjpfan commented Aug 10, 2022 •

edited

Loading

thomasjpfan commented Aug 10, 2022

jjerphan commented Aug 10, 2022

jeremiedbb commented Aug 31, 2022 •

edited

Loading

Micky774 commented Sep 20, 2022

jeremiedbb left a comment

jjerphan commented Oct 7, 2022

Micky774 commented Oct 25, 2022 •

edited

Loading

Vincent-Maladiere commented Oct 28, 2022

lorentzenchr commented Feb 9, 2023

jjerphan commented Mar 20, 2023

jeremiedbb commented Mar 20, 2023

thomasjpfan commented Mar 20, 2023

MNT Use float64_t and intp_t directly in Cython for _pairwise_distances_reduction #24153

MNT Use float64_t and intp_t directly in Cython for _pairwise_distances_reduction #24153

Conversation

thomasjpfan commented Aug 9, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

jjerphan left a comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Aug 10, 2022

Choose a reason for hiding this comment

thomasjpfan Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

thomasjpfan Aug 10, 2022

Choose a reason for hiding this comment

thomasjpfan commented Aug 10, 2022

thomasjpfan commented Aug 10, 2022 • edited Loading

thomasjpfan commented Aug 10, 2022

jjerphan commented Aug 10, 2022

jeremiedbb commented Aug 31, 2022 • edited Loading

Micky774 commented Sep 20, 2022

jeremiedbb left a comment

Choose a reason for hiding this comment

jjerphan commented Oct 7, 2022

Micky774 commented Oct 25, 2022 • edited Loading

Vincent-Maladiere commented Oct 28, 2022

lorentzenchr commented Feb 9, 2023

jjerphan commented Mar 20, 2023

jeremiedbb commented Mar 20, 2023

thomasjpfan commented Mar 20, 2023

thomasjpfan Aug 10, 2022 •

edited

Loading

thomasjpfan commented Aug 10, 2022 •

edited

Loading

jeremiedbb commented Aug 31, 2022 •

edited

Loading

Micky774 commented Oct 25, 2022 •

edited

Loading