Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT Use float64_t and intp_t directly in Cython for _pairwise_distances_reduction #24153

Closed
wants to merge 11 commits into from

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Related to #23865

What does this implement/fix? Explain your changes.

As noted in #23865 (comment), I agree that using DTYPE_t and ITYPE_t is a bit confusing and that using float64_t and intp_t is easier to understand.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this, @thomasjpfan.

In #23865 (comment), you mentioned that we could even get ride of NumPy declaration of types and use the concrete types directly.

Which do you prefer?

Also, this PR modifies _hashing_fast.pyx, which is out of the scope of _pairwise_distances_reduction. I am fine modifying other Cython sources for consistency in this PR or in smaller ones.

What do you think?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would be in favor of smaller independent PRs:

  • one for the metadata move (it seems quite unrelated to the Cython types, isn't it?)
  • and one PR per independent submodule with Cython code.

@@ -29,7 +29,7 @@
"scipy": (SCIPY_MIN_VERSION, "build, install"),
"joblib": (JOBLIB_MIN_VERSION, "install"),
"threadpoolctl": (THREADPOOLCTL_MIN_VERSION, "install"),
"cython": (CYTHON_MIN_VERSION, "build"),
"Cython": (CYTHON_MIN_VERSION, "build"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is the casing impacting a build step? Is this related to the use of pyproject.toml / setuptools.build_meta?

Copy link
Member Author

@thomasjpfan thomasjpfan Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing the pyproject.toml commit was a mistake.

I was experimenting with moving dependencies to pyproject.toml and it ended up in this branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the question, the case is not required. I had a strict check to make sure pyproject.toml was consistent with _min_dependencies.py which was case-sensitive.

@thomasjpfan
Copy link
Member Author

Also, this PR modifies _hashing_fast.pyx

There was a dependency chain that gets to _hashing_fast. Updating _pairwise_distances_reduction requires updating _vector_sentinel which ends up modifying _hashing_fast.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Aug 10, 2022

In #23865 (comment), you mentioned that we could even get ride of NumPy declaration of types and use the concrete types directly.

Although, I am in favor of using float and double, I suspect cnp.float32 and cnp.float64 is easier to understand. When we call the NumPy's Python API with dtype={np.float{32,64} there is a direct one-to-one connect to cnp.float{32,64}.

For this PR, we also need to replace np.intp with Py_intptr_t:

https://github.com/numpy/numpy/blob/bb95cf015165bcf85f99ba0d28a7a6e727ca955b/numpy/__init__.pxd#L24
https://github.com/numpy/numpy/blob/a642e1f0437571be82966069484dd359732c852b/numpy/__init__.pxd#L721

@jjerphan Do you have an opinion on {float,double} vs {cnp.float32, cnp.float64}?

@thomasjpfan
Copy link
Member Author

Looking at the discussion in #23865 (comment), I'm okay with going with cnp.float{32, 64}.

@jjerphan
Copy link
Member

I confirm that I find numpy aliases (i.e. cnp.{float32, cnp.float64}_t and np.{float32, np.float64}) nicer to work with as they are explicit.

@jeremiedbb
Copy link
Member

jeremiedbb commented Aug 31, 2022

I'm also strongly in favor of using cnp.float32/64_t. It reduces confusion a lot

@Micky774
Copy link
Contributor

For this PR, we also need to replace np.intp with Py_intptr_t:

This is far more pedantic than we really need to care about, but by PEP 0353 I believe the canonical choice is Py_ssize_t. One the NumPy side of things everything is compatible with Py_ssize_t afaik.

Relevant from PEP 0353:

All occurrences of index and length parameters and results are changed to use Py_ssize_t, including the sequence slots in type objects, and the buffer interface.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I don't have a strong opinion between np.intp_t or Py_ssize_t to loop through arrays. Both are platform dependent which I'm not very enthusiastic about. We could also use np.int32/64_t to not be platform dependent and fully explicit.

@jjerphan
Copy link
Member

jjerphan commented Oct 7, 2022

Thinking again, I think having consistency and uniformity for types and fused types definitely makes sense. Yet, I am not entirely sure directly using concrete types is the best option because we sometimes might want to have some semantic for types.

I think that there are various usages and aliases for types and fused types (some of which might be relevant) in the code base and that it would be appropriate to perform a comprehensive inspection of those. What do you think?

(My previous remark regarding using types aliases from np and cnp rather than C concrete types directly (#24153 (comment)) still holds.)

@Micky774
Copy link
Contributor

Micky774 commented Oct 25, 2022

I think that there are various usages and aliases for types and fused types (some of which might be relevant) in the code base and that it would be appropriate to perform a comprehensive inspection of those. What do you think?

Honestly, the approach of library-wide aliases has been more frustrating than helpful to me personally in the past. One example is as we're trying to figure out how to support things such as sparse indices dtypes, e.g. #23731. I feel inclined to prefer explicit typing by default, and file-specific aliasing as needed.

This is especially important in algorithms with more nuanced fused-type behavior -- for example, what if two arrays both have the same type semantics, but we need to generate the cross-product of their fused types? Something like some_func(SPARSE_IDX[:] A, SPARSE_IDX[:] B) demands A,B are the exact same type, however if we want a full a cross-product then we need to explicitly define redundant types, e.g. SPARSE_IDX_A, SPARSE_IDX_B. This means that we would then have:

  • global scikit-learn defined types (SPARSE_IDX)
  • transient types for code generation (SPARSE_IDX_{A,B})
  • potentially special-case fused-types for niche purposes (SPARSE_IDX_UINT32)

In such cases, the presence of the global defined type is more obstructive than helpful in my opinion. This would also pose a barrier to entry to new Cython developers that may have difficulty differentiating the semantics of global aliases vs the transient aliases that Cython sometimes demands for code generation.

I'm not super opposed to this, however I have definitely felt the friction due to it before.

@Vincent-Maladiere
Copy link
Contributor

As a new Cython developer, I found navigating between types and fuse types a bit tedious initially. In #24556, I would have preferred using cnp.{float32, cnp.float64}_t and np.{float32, np.float64} directly, although I'm probably not aware of the semantic gains that aliases could yield.

So definitely sharing the opinion that limiting the number of types definition levels would help onboarding new users.

@lorentzenchr
Copy link
Member

Let us discuss in #25572.

@jjerphan
Copy link
Member

I think this PR can be closed since the agreed solution discussed in #25572 is different from the proposal of this PR. What do you think, @thomasjpfan?

@jeremiedbb
Copy link
Member

Note that #25810 supersedes this PR with the solution decided in #25572

@thomasjpfan
Copy link
Member Author

Yup, let's close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants