Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 33.9k
gh-111545: Add PyHash_Double() function#112095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
vstinner commented Nov 15, 2023 • edited by github-actions bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by github-actions bot
Uh oh!
There was an error while loading. Please reload this page.
vstinner commented Nov 15, 2023
numpy has a
Proposed API has a single argument and cannot be used as a drop-in replacement for Py_hash_thash=PyHash_Double(value); if (hash==-1){return_Py_HashPointer(obj)} returnhash;Problem: Since numpy already has a The Python now uses the identity for the hash when the value is a NaN, see gh-87641. In Python 3.9, By the way, in Python 3.13,
|
vstinner commented Nov 15, 2023
Another slice of Python history. In Python 3.2, |
vstinner commented Nov 15, 2023
I'm not a fan of signed number of hash. For example, I prefer to avoid it when using modulo operator ( The signed |
vstinner commented Nov 15, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Draft PR gh-112096. I prefer to only start with
|
vstinner commented Nov 15, 2023
vstinner commented Nov 15, 2023
I merged a first change to make this PR smaller and so easier to review. The PR #112098 added documentation and tests on the PyHash_GetFuncDef() function which was added by PEP 456. |
vstinner commented Nov 15, 2023
The PR adds
If needed, a second function can be added: Py_hash_tPyHash_DoubleOrPointer(doublevalue, constvoid*ptr)=> Compute value hash, or compute ptr hash if value is not-a-number (NaN). For me, it's surprising that when passing a Python object in Or do you prefer to just expose |
skirpichev commented Nov 15, 2023
I think, this attribute should be deprecated (or just removed?). |
Doc/c-api/hash.rst Outdated
| Hash a C double number. | ||
| Return ``-1`` if *value* is not-a-number (NaN). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should document return value for inf too? This is exposed in sys.hash_info.
vstinnerNov 15, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to wait until the _PyHASH_INF constant is added to the API. That's the C API documentation, not the Python documentation.
Doc/c-api/hash.rst Outdated
| Functions | ||
| ^^^^^^^^^ | ||
| .. c:function:: Py_hash_t PyHash_Double(double value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to expose this as unstable API. Hashing of numeric types is relatively low-level detail of implementation, which was changed in past for minor (3.x.0) releases (nans hashing was last). Why not keep this freedom in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect PyHash_Double() API to change in the future. The result of the function can change in Python 3.x.0 releases, but I don't consider that it qualifies the function for the PyUnstable API.
The PyUnstable API is more when there is a risk that the function can be removed, that its API can change, or that a major change can happen in a Python 3.x.0 release.
In Python 3.2 (2010), _Py_HashChange() was written in commit dc787d2 of issue gh-52435.
commit dc787d2055a7b562b64ca91b8f1af6d49fa39f1c Author: Mark Dickinson <dickinsm@gmail.com> Date: Sun May 23 13:33:13 2010 +0000 Issue #8188: Introduce a new scheme for computing hashes of numbers (instances of int, float, complex, decimal.Decimal and fractions.Fraction) that makes it easy to maintain the invariant that hash(x) == hash(y) whenever x and y have equal value. As written above, the latest major change was in Python 3.10 to treat NaN differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result of the function can change in Python 3.x.0 releases
Previously, the function signature for _Py_HashDouble() was changed too.
vstinner commented Nov 15, 2023
If you consider that something should be changed, you can open a new issue about |
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR contains many cosmetic changes, but also some changes that can affect performance in theory (like adding the "else" branch or adding additional check for -1). Please make precise benchmarks for this. Also consult with previous authors of this code.
To make PyHash_Double a replacement of _PyHash_Double you need Py_HashPointer. Maybe add it first?
BTW, should it be PyHash_Double or Py_HashDouble?
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
encukou commented Nov 15, 2023
In terms of the proposed C API guidleines: this violates the one where negative results are reserved for errors. Actually, hashes might be a good argument for only reserving -1 for errors, leaving the other negative numbers mean success. |
vstinner commented Nov 15, 2023
24c3f6d to a58dcd1Compare* Add again _PyHASH_NAN constant. * _Py_HashDouble(NULL, value) now returns _PyHASH_NAN. * Add tests: Modules/_testcapi/hash.c and Lib/test/test_capi/test_hash.py.
vstinner commented Nov 15, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
I updated the PR to address @serhiy-storchaka and @encukou's comments. @serhiy-storchaka and @encukou: Please review the updated PR. The API changed to: The interesting case is that obj can be Changes:
|
| * If *obj* is not ``NULL``, return the hash of the *obj* pointer. | ||
| * Otherwise, return :data:`sys.hash_info.nan <sys.hash_info>` (``0``). | ||
| The function cannot fail: it cannot return ``-1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we do want users to check the result, so that the function can start failing in some cases in the future.
| The function cannot fail: it cannot return ``-1``. | |
| On failure, the function returns ``-1`` and sets an exception. | |
| (``-1`` is not a valid hash value; it is only returned on failure.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point of asking developers to make their code slower for a case which cannot happen. It would make C extensions slower for no reason, no?
PyObject_Hash(obj) can call arbitrary __hash__() method in Python and so can fail. But PyHash_Double() is simple and cannot fail. It's just that it has the same API than PyObject_Hash() and PyTypeObject.tp_hash for convenience.
For me, it's the same as PyType_CheckExact(obj): the function cannot fail. Do you want to suggest users to start checking for -1 because the API is that it may set an exception and return -1? IMO practicability beats purity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am strongly for allowing deprecation via runtime warnings, and for keeping new API consistent in that respect.
If the speed is an issue (which I doubt, with branch prediction around), let's solve that in a way that still allows the API to report errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am strongly for allowing deprecation via runtime warnings, and for keeping new API consistent in that respect.
I created capi-workgroup/api-evolution#43 to discuss functions which cannot fail: when the caller is not expected to check for errors.
If the speed is an issue (which I doubt, with branch prediction around), let's solve that in a way that still allows the API to report errors.
Would you mind to elaborate how you plan to solve this issue?
My concern is more about usability of the API than performance here.
But yeah, performance matters as well. Such function can be used in a hash table (when floats as used as key), and making such function as fast as possible matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind to elaborate how you plan to solve this issue?
It's possible for specific compilers: add a static inline wrapper with if (result == -1) __builtin_unreachable(); or __assume(result != -1).
That way the compiler can optimize error checking away, until a later Python version decides to allow failures.
vstinner commented Nov 17, 2023
numpy issue: numpy/numpy#25035 |
encukou commented Nov 20, 2023
Stepping back, I think I figured out why this API seems awkward to me. The actual use case the proposed function allows is implementing a custom object that needs to hash the same as a Python double, where NaN should be hashable (like Python doubles are). That use case is what Python needs, and I assume it's what NumPy needs, but IMO it's unnecessarily limited -- which makes the function well suited for an internal (or unstable) function, but worse for public API. With a signature like: intPyHash_Double(doublevalue, Py_hash_t*result) // -> 1 for non-NaN (*result is set to the hash)// -> 0 for NaN (*result is set to 0)the function would cover the more general use case as well. The docs can note that when implementing hash for a custom object, if you get a 0 result you can:
(I'll leave the fallibility argument to the WG repos, just noting that this API would allow adding -1 as a possible result.) |
mdickinson commented Nov 26, 2023
FWIW, I also find the proposed API ( The signature proposed by @encukou looks reasonable to me. I don't want to get involved in the discussion about error returns, but if we really wanted to we could have a documented sentinel hash value that's only ever returned for NaNs (e.g., 2**62). |
vstinner commented Nov 27, 2023
There is The problem with |
vstinner commented Nov 27, 2023
The first PR version used the API: The second PR version changed the API to Well, nobody (including me, to be honest) likes So I wrote PR #112449 which implements the API proposed by @encukou: |
serhiy-storchaka commented Nov 27, 2023
I like the simpler API proposed in this PR more, my only concern is about performance. Can the difference be observed in microbenchmarks or it is insignificant? Instead of checking the result of the function, the user code can check the argument before calling the function: if (Py_IS_NAN(value)){return_Py_HashPointer(obj)} returnPyHash_Double(value);As for names, "Hash" is a verb, and "Double" is an object. |
encukou commented Nov 27, 2023
I don't see this used in tight loops, so I'd go for the prettier API even if it's a few instructions slower. (Note that NumPy uses it for scalars -- degenerate arrays of size 1 -- to ensure compatibility with Python doubles.) +1 on the naming note, |
mdickinson commented Nov 27, 2023
Yes, I'm rather familiar with the history here. :-) I was simply suggesting that if we picked and documented a value that's not used for the hash of any non-NaN, then the hash value itself could be a way of detecting NaNs after the fact. In any case, I was just mentioning this as a possibility. I'd prefer a more direct method of NaN detection, like what you have currently implemented (or not having the API do NaN detection at all, but leave that to the user to do separately if necessary). |
vstinner commented Nov 27, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
I ran microbenchmarks on 3 different APIs. Measured performance is between 13.6 ns and 14.7 ns. The maximum difference is 1.1 ns: 1.08x slower. It seems like the current _Py_HashDouble() API is the fastest. In the 3 APIs, the C double input value is passed through the I expected I wrote 3 benchmarks on:
Results using CPU isolation, Python built with
I added benchmark code in I added an artificial test on the hash value to use it in the benchmark, so the compiler doesn't remove the whole function call, and the code is a little bit more realistic. (A) assembly code: (B) assembly code: (C) assembly code: Change used to benchmark (A) and (B). Measuring (C) requires minor changes. diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c index 4607a3faf17..9b671acf916 100644 --- a/Modules/_testinternalcapi.c+++ b/Modules/_testinternalcapi.c@@ -1625,6 +1625,51 @@ get_type_module_name(PyObject *self, PyObject *type) } +static PyObject *+test_bench_private_hash_double(PyObject *Py_UNUSED(module), PyObject *args)+{+ Py_ssize_t loops;+ if (!PyArg_ParseTuple(args, "n", &loops)){+ return NULL;+ }+ PyObject *obj = Py_None;+ double d = 1.0;++ _PyTime_t t1 = _PyTime_GetPerfCounter();+ for (Py_ssize_t i=0; i < loops; i++){+ Py_hash_t hash = _Py_HashDouble(obj, d);+ if (hash == -1){+ return NULL;+ }+ }+ _PyTime_t t2 = _PyTime_GetPerfCounter();++ return PyFloat_FromDouble(_PyTime_AsSecondsDouble(t2 - t1));+}+++static PyObject *+test_bench_public_hash_double(PyObject *Py_UNUSED(module), PyObject *args)+{+ Py_ssize_t loops;+ if (!PyArg_ParseTuple(args, "n", &loops)){+ return NULL;+ }+ double d = 1.0;++ _PyTime_t t1 = _PyTime_GetPerfCounter();+ for (Py_ssize_t i=0; i < loops; i++){+ Py_hash_t hash;+ if (PyHash_Double(d, &hash) == 0){+ return NULL;+ }+ }+ _PyTime_t t2 = _PyTime_GetPerfCounter();++ return PyFloat_FromDouble(_PyTime_AsSecondsDouble(t2 - t1));+}++ static PyMethodDef module_functions[] ={{"get_configs", get_configs, METH_NOARGS},{"get_recursion_depth", get_recursion_depth, METH_NOARGS}, @@ -1688,6 +1733,8 @@ static PyMethodDef module_functions[] ={{"restore_crossinterp_data", restore_crossinterp_data, METH_VARARGS}, _TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF{"get_type_module_name", get_type_module_name, METH_O}, +{"bench_private_hash_double", test_bench_private_hash_double, METH_VARARGS},+{"bench_public_hash_double", test_bench_public_hash_double, METH_VARARGS},{NULL, NULL} /* sentinel */ }; Bench script for (A), private API: importpyperfimport_testinternalcapirunner=pyperf.Runner() runner.bench_time_func('bench', _testinternalcapi.bench_private_hash_double)Bench script for (B) and (C), public API: importpyperfimport_testinternalcapirunner=pyperf.Runner() runner.bench_time_func('bench', _testinternalcapi.bench_public_hash_double) |
vstinner commented Nov 27, 2023
Other benchmark results using PGO:
These numbers are surprising. |
vstinner commented Nov 27, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
I wrote PR #112476 to run this benchmark differently. Results look better (less surprising, more reliable than my previous benchmark). Results with CPU isolation,
API (A) and (C) have the same performance. API (B) is 0.9 ns slower than (A) and (C): it is 1.07x slower than (A) and (C). I ran the benchmark with |
vstinner commented Nov 28, 2023
If we only care about performance, an alternative is API (D):
Note: Passing non-NULL is_nan or passing NULL is_nan has no significant impact on API (D) performance. It may be interesting if you know that the number cannot be NaN. |
vstinner commented Nov 28, 2023
It would be interesting to design the C API namespace in a similar way than Python packages and Python package sub-modules: But Python C API is far from respecting such design :-) The "Py_" namespace is a giant bag full of "anything". |
vstinner commented Nov 28, 2023
By the way, see also PR #112096 which adds PyHash_Pointer() function. |
serhiy-storchaka commented Nov 28, 2023
How much it makes difference if remove the check for NaN from the function, but add it before calling the function, like in #112095 (comment) ? |
encukou commented Nov 28, 2023
As far as I know, microbenchmarks at this level are susceptible to “random” variations due to code layout. An individual function should be benchmarked in a variety of calling situations to get a meaningful result. (But to reiterate: I don't think this is the place for micro-optimizations.) |
vstinner commented Nov 28, 2023
It would be bad to return the same hash value for +inf, -inf and NaN values. Current code: if (!Py_IS_FINITE(v)){if (Py_IS_INFINITY(v)) returnv>0 ? _PyHASH_INF : -_PyHASH_INF; elsereturn_Py_HashPointer(inst); // v is NaN } |
vstinner commented Nov 28, 2023
We can take it in account in the API design. Now we know that the API I don't think that 0.9 ns is a significant difference. If a workflow is impacted by 0.9 ns per function call, maybe they should copy PyHash_Double() code and design a very specialized flavor for their workflow (ex: remove code for infinity and NaN, and inline all code). In terms of performance, I think that any proposed API is fine. |
vstinner commented Nov 28, 2023
@serhiy-storchaka@encukou: Do you prefer
I think that now that I read previous discussions, I prefer |
encukou commented Nov 30, 2023
Yeah, |
vstinner commented Nov 30, 2023
First, I proposed I'm not fully comfortable with this API neither. I close this PR. Let's continue the discussion in PR #112449 which implements the API proposed by @encukou. |
Cleanup PyHash_Double() implementation based _Py_HashDouble():
Add tests: Modules/_testcapi/hash.c and Lib/test/test_capi/test_hash.py.
_Py_HashDoublepublic again as "unstable" API #111545📚 Documentation preview 📚: https://cpython-previews--112095.org.readthedocs.build/