-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT Use float64_t and intp_t directly in Cython for _pairwise_distances_reduction #24153
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this, @thomasjpfan.
In #23865 (comment), you mentioned that we could even get ride of NumPy declaration of types and use the concrete types directly.
Which do you prefer?
Also, this PR modifies _hashing_fast.pyx
, which is out of the scope of _pairwise_distances_reduction
. I am fine modifying other Cython sources for consistency in this PR or in smaller ones.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would be in favor of smaller independent PRs:
- one for the metadata move (it seems quite unrelated to the Cython types, isn't it?)
- and one PR per independent submodule with Cython code.
sklearn/_min_dependencies.py
Outdated
@@ -29,7 +29,7 @@ | |||
"scipy": (SCIPY_MIN_VERSION, "build, install"), | |||
"joblib": (JOBLIB_MIN_VERSION, "install"), | |||
"threadpoolctl": (THREADPOOLCTL_MIN_VERSION, "install"), | |||
"cython": (CYTHON_MIN_VERSION, "build"), | |||
"Cython": (CYTHON_MIN_VERSION, "build"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is the casing impacting a build step? Is this related to the use of pyproject.toml
/ setuptools.build_meta
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing the pyproject.toml commit was a mistake.
I was experimenting with moving dependencies to pyproject.toml
and it ended up in this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the question, the case is not required. I had a strict check to make sure pyproject.toml
was consistent with _min_dependencies.py
which was case-sensitive.
This reverts commit f9bb33e.
There was a dependency chain that gets to |
Although, I am in favor of using For this PR, we also need to replace https://github.com/numpy/numpy/blob/bb95cf015165bcf85f99ba0d28a7a6e727ca955b/numpy/__init__.pxd#L24 @jjerphan Do you have an opinion on |
Looking at the discussion in #23865 (comment), I'm okay with going with |
I confirm that I find numpy aliases (i.e. |
I'm also strongly in favor of using cnp.float32/64_t. It reduces confusion a lot |
This is far more pedantic than we really need to care about, but by PEP 0353 I believe the canonical choice is Relevant from PEP 0353:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I don't have a strong opinion between np.intp_t
or Py_ssize_t
to loop through arrays. Both are platform dependent which I'm not very enthusiastic about. We could also use np.int32/64_t
to not be platform dependent and fully explicit.
Thinking again, I think having consistency and uniformity for types and fused types definitely makes sense. Yet, I am not entirely sure directly using concrete types is the best option because we sometimes might want to have some semantic for types. I think that there are various usages and aliases for types and fused types (some of which might be relevant) in the code base and that it would be appropriate to perform a comprehensive inspection of those. What do you think? (My previous remark regarding using types aliases from |
Honestly, the approach of library-wide aliases has been more frustrating than helpful to me personally in the past. One example is as we're trying to figure out how to support things such as sparse indices dtypes, e.g. #23731. I feel inclined to prefer explicit typing by default, and file-specific aliasing as needed. This is especially important in algorithms with more nuanced fused-type behavior -- for example, what if two arrays both have the same type semantics, but we need to generate the cross-product of their fused types? Something like
In such cases, the presence of the global defined type is more obstructive than helpful in my opinion. This would also pose a barrier to entry to new Cython developers that may have difficulty differentiating the semantics of global aliases vs the transient aliases that Cython sometimes demands for code generation. I'm not super opposed to this, however I have definitely felt the friction due to it before. |
As a new Cython developer, I found navigating between types and fuse types a bit tedious initially. In #24556, I would have preferred using So definitely sharing the opinion that limiting the number of types definition levels would help onboarding new users. |
Let us discuss in #25572. |
I think this PR can be closed since the agreed solution discussed in #25572 is different from the proposal of this PR. What do you think, @thomasjpfan? |
Yup, let's close this PR. |
Reference Issues/PRs
Related to #23865
What does this implement/fix? Explain your changes.
As noted in #23865 (comment), I agree that using
DTYPE_t
andITYPE_t
is a bit confusing and that usingfloat64_t
andintp_t
is easier to understand.