Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory access out of bounds in numpy + chrome 89 #1474

Merged
merged 23 commits into from
Apr 17, 2021

Conversation

hoodmane
Copy link
Member

@hoodmane hoodmane commented Apr 15, 2021

This is an attempt to fix #1473.

@hoodmane hoodmane changed the title Another numpy patch Fix memory access out of bounds in numpy + chrome 89 Apr 15, 2021
@hoodmane
Copy link
Member Author

hoodmane commented Apr 15, 2021

@rth There's a new intermittent failure in build-core =(

error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the
installation directory:

    [Errno 17] File exists: '/root/repo/packages/.artifacts/lib/python'

The installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:

    /root/repo/packages/.artifacts/lib/python/

@rth
Copy link
Member

rth commented Apr 15, 2021

Thanks for trying to fix this!

error: can't create or remove files in install directory

Never saw this one before so far.

@hoodmane
Copy link
Member Author

hoodmane commented Apr 15, 2021

I think I've identified a general (but tedious) procedure to fix these:

  1. Locate failing instruction, it's either a load or a store
  2. Locate line of source failing instruction occurred inside of
  3. Extract offending load or store into a separate noinline function
  4. Recompile and repeat on next failing load or store

@hoodmane
Copy link
Member Author

Actually I think I can add:

__attribute__((noinline))
NPY_NO_EXPORT
void
assign_to_ptr(void *ptr, void* val){
    *ptr = val;
}

__attribute__((noinline))
NPY_NO_EXPORT
void *
read_from_ptr(void *ptr){
    return *ptr;
}

and replace most memory accesses in PyArray_Broadcast with these two functions.

@hoodmane
Copy link
Member Author

@rth I suspect fixing 0.17.0 + chrome 89 + numpy may be a lost cause. Question is, why are these random memory access failures only happening in numpy? Is it loaded in an unusual way? Or maybe these load/store issues aren't just in numpy but also in any C extension in a side module? How much test code do we have that uses C extension modules?

I think we have good reason to believe that this chrome 89 bug does not affect the core Pyodide since that code has a lot of coverage. I think it's possible that not just numpy but all extension modules are broken. If it really is only numpy and not all extension modules that are affected, then we would need to get to the bottom of why numpy specifically.

One thing that would be useful would be to go through the benchmarks one at a time and see which ones succeed / fail. That probably would provide valuable clues.

@rth
Copy link
Member

rth commented Apr 15, 2021

Question is, why are these random memory access failures only happening in numpy? [...] I think it's possible that not just numpy but all extension modules are broken.

Hmm, well we had some of those before as well e.g. in scipy #729 (comment) even before the emscripten migration (for all browsers). So it's not a completely new occurrence, just new in numpy.

I suspect fixing 0.17.0 + chrome 89 + numpy may be a lost cause.

Yeah, if it's widespread patching is likely not going to be enough. How about emscripten or LLVM options for instance to prevent inlining https://github.com/emscripten-core/emscripten/blob/e91b2274b84cafcad05382f77e4c4b6e1ddc354d/src/settings.js#L292. Those can be added for numpy specifically (in meta.yaml) or for all packages.

How much test code do we have that uses C extension modules?

We can run package test suites inside pyodide in REPL. For instance,

>>> import pytest
>>> import regex
>>> pytest.main(['--pyargs', 'regex', '--continue-on-collection-errors'])

to run tests for the regex package (which has C extensions) and produces in Chrome

�[31m�[1m===================== 1 failed, 100 passed in 2.32 seconds =====================�[0m

So it doesn't look like all C extensions are broken.

For numpy better to run those tests submodule by submodule (e.g. on numpy.fft), as there is no output until all tests are run, and it takes a while. Previously for numpy there weren't that many test failures #69 (comment) We should probably start running those tests more systematically (if not for each commit). It's also possible to use -k flags to filter specific tests as with normal pytest run, so I think that should help us identify parts of the numpy test suite that fail with this error at least.

Edit: for instance in Firefox, numpy.core tests run tests as

�[31m�[1m== 27 failed, 2138 passed, 16 skipped, 7 xfailed, 3 error in 1013.10 seconds ===�[0m 1 >>> 

in Chrome, the browser tab crashes for dev version so I probably need to rebuild with debug symbols as you did.

@hoodmane
Copy link
Member Author

it doesn't look like all C extensions are broken.

Well that's good =)

We should probably start running those tests more systematically (if not for each commit).

Ideally it'd be nice to run occasional tests on other browsers too. It'd be great to have a larger test suite that would run say every two weeks but only if there's been a new commit. Sounds annoying to set up though.

@hoodmane
Copy link
Member Author

numpy.core tests

I think the core test suite should likely catch these bugs we've been seeing.

@hoodmane
Copy link
Member Author

One thing that is more worrisome about this than say #729 is that the errors in #729 are at free and malloc. These numpy problems occur at normal wasm load and store instructions, so in principle they could occur at any memory operation. Particularly in PyArray_Broadcast the failures seem to be willing to move between different memory operations without any clear pattern.

If the troubles are only in PyArray_Broadcast then I'm pretty confident my current methods will be sufficient with enough patience.

@hoodmane
Copy link
Member Author

hoodmane commented Apr 16, 2021

How about emscripten or LLVM options for instance to prevent inlining?

Unfortunately these errors are not problems with inlining. The crashes are happening with stuff like:

for(i = 0; i < len; i++){
    max = MAX(array[i], max);
}
for(i = 0; i < len; i++){
    x = array[i]; // crash here
    // do something with x
}

@hoodmane hoodmane added this to the 0.17.0 milestone Apr 16, 2021
@hoodmane
Copy link
Member Author

hoodmane commented Apr 16, 2021

@rth The docs say:

build/skip_host

Skip building C extensions for the host environment. Default: True.

Setting this to False will result in ~2x slower builds for packages that include C extensions. It should only be needed when a package is a build time dependency for other packages. For instance, numpy is imported during installation of matplotlib, importing numpy also imports included C extensions, therefore it is built both for host and target.

Is there a reason we couldn't use pip to install numpy in the Dockerfile and then just use that for the host numpy rather than building it? I think this could speed up build packages by a lot and also fix all those import numpy crashes during the scipy and matplotlib builds.

@rth
Copy link
Member

rth commented Apr 16, 2021

Is there a reason we couldn't use pip to install numpy in the Dockerfile and then just use that for the host numpy rather than building it?

It might work, but I think we still need to patch the configuration

cp config/* $PKGDIR/../.artifacts/lib/python/numpy-1.17.5-py3.8-linux-x86_64.egg/numpy/core/include/numpy

In any case, #1476 would be a solution to that build issue I think.

@hoodmane
Copy link
Member Author

@rth Okay this fixes it! (Benchmarks passed in test suite.)

@hoodmane
Copy link
Member Author

hoodmane commented Apr 16, 2021

Looks like there is one more segfault in the chrome packages test suite, this one in PyObject_GenericGetAttr, in pandas/test_pandas.py::test_load_largish_file:

E           conftest.JavascriptException: RuntimeError: memory access out of bounds
E               at __static_1 (<anonymous>:wasm-function[174]:0x17748)
E               at byn$fpcast-emu$__static_1 (<anonymous>:wasm-function[506]:0x87748)
E               at __static_2 (<anonymous>:wasm-function[175]:0x17eca)
E               at __static_199 (<anonymous>:wasm-function[339]:0x538d5)
E               at byn$fpcast-emu$__static_199 (<anonymous>:wasm-function[704]:0x881f7)
E               at __static_254 (http://127.0.0.1:35323/pyodide.asm.wasm:wasm-function[900]:0x1d3c92)
E               at byn$fpcast-emu$__static_254 (http://127.0.0.1:35323/pyodide.asm.wasm:wasm-function[21695]:0x7bc48e)
E               at _PyObject_GenericGetAttrWithDict (http://127.0.0.1:35323/pyodide.asm.wasm:wasm-function[1712]:0x2183aa)
E               at PyObject_GenericGetAttr (http://127.0.0.1:35323/pyodide.asm.wasm:wasm-function[1711]:0x21815f)

@rth
Copy link
Member

rth commented Apr 16, 2021

Okay this fixes it! (Benchmarks passed in test suite.)

Wonderful , great work @hoodmane !

Looks like there is one more segfault in the chrome packages test suite

I'll try to run the tests numpy test suite later and will report how it goes. It might be worth adding a comment on how you compute those offsets.

@hoodmane
Copy link
Member Author

It might be worth adding a comment on how you compute those offsets.

Working on cleaning things up now.

@hoodmane
Copy link
Member Author

hoodmane commented Apr 16, 2021

Okay I think this is ready to merge when the CI finishes.

@hoodmane hoodmane requested a review from rth April 16, 2021 19:40
@hoodmane
Copy link
Member Author

JsException: ReferenceError: it_dims_k is not defined

@rth
Copy link
Member

rth commented Apr 17, 2021

The good news is as far as I can tell this issue (or at least the original one) is not happening with Chrome 91 (and possibly 90) and with browser auto-update, most users actually seem to use the latest Chrome version. So 6 month after 91 release we could very well set it as minimum requirement and remove these patches.

Spend some more time digging through the Chrome issue tracker, and opened a new issue https://bugs.chromium.org/p/chromium/issues/detail?id=1200031 to make sure it doesn't go unreported (if it wasn't already fixed)

@rth
Copy link
Member

rth commented Apr 17, 2021

The code looks fine to me, thanks for the documentation effort.

Let me just run a few more numpy tests before merging.

@hoodmane
Copy link
Member Author

opened a new issue https://bugs.chromium.org/p/chromium/issues/detail?id=1200031 to make sure it doesn't go unreported

Thanks, that clearly needed to be done. I added some info there.

@hoodmane
Copy link
Member Author

@rth We good to merge this now?

@rth
Copy link
Member

rth commented Apr 17, 2021

Run a few numpy broadcasting tests, to check the soundness of the re-implementation,

>>> import numpy
>>> import pytest
>>> pytest.main(['--pyargs', 'numpy', '--continue-on-collection-errors', '-k', 'broadcast'])
  • on master (in Firefox 87),
    128 passed, 8727 deselected, 3 error in 267.95 seconds
  • in this PR (in Firefox 87),
    128 passed, 8727 deselected, 3 error in 264.91 seconds
  • in this PR (in Chrome 89)
    128 passed, 8727 deselected, 3 error in 215.32 seconds

The 3 failures are due to ctypes. To measurable impact on the overall performance either.

Great work @hoodmane !

@rth rth merged commit 1485732 into pyodide:master Apr 17, 2021
@hoodmane hoodmane deleted the numpy-fix-2 branch April 17, 2021 21:59
@rth rth mentioned this pull request Apr 19, 2021
3 tasks
@daoxian
Copy link
Contributor

daoxian commented Apr 20, 2021

Still crashed in Chrome 90.0.4430.72.

pyodide.js:432 Pyodide has suffered a fatal error, refresh the page. Please report this to the Pyodide maintainers.
Module.fatal_error @ pyodide.js:432
pyodide.js:433 The cause of the fatal error was:
Module.fatal_error @ pyodide.js:433
pyodide.js:434 RuntimeError: memory access out of bounds
    at PyArray_Broadcast (<anonymous>:wasm-function[1536]:0xcdab3)
    at <anonymous>:wasm-function[1538]:0xce06f
    at __static_809 (<anonymous>:wasm-function[1542]:0xce5f0)
    at byn$fpcast-emu$__static_809 (<anonymous>:wasm-function[6879]:0x219772)
    at __static_861 (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[1972]:0x22cb48)
    at byn$fpcast-emu$__static_861 (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[22302]:0x7be3ba)
    at PyObject_Call (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[763]:0x1caf37)
    at _PyEval_EvalFrameDefault (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[2758]:0x2a9d54)
    at byn$fpcast-emu$_PyEval_EvalFrameDefault (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[15507]:0x7a5f64)
    at _PyEval_EvalCodeWithName (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[2754]:0x2a422c)
Module.fatal_error @ pyodide.js:434
pyodide.asm.js:9 Stack (most recent call first):
pyodide.asm.js:9   File "/lib/python3.8/site-packages/numpy/lib/stride_tricks.py", line 191 in _broadcast_shape
pyodide.asm.js:9   File "/lib/python3.8/site-packages/numpy/lib/stride_tricks.py", line 264 in broadcast_arrays
pyodide.asm.js:9   File "<__array_function__ internals>", line 5 in broadcast_arrays
pyodide.asm.js:9   File "/lib/python3.8/site-packages/numpy/lib/function_base.py", line 4213 in meshgrid
pyodide.asm.js:9   File "<__array_function__ internals>", line 5 in meshgrid
pyodide.asm.js:9   File "/lib/python3.8/site-packages/pysot/tracker/siamrpn_tracker.py", line 85 in generate_anchor
pyodide.asm.js:9   File "/lib/python3.8/site-packages/pysot/tracker/siamrpn_tracker.py", line 59 in __init__
pyodide.asm.js:9   File "<exec>", line 9 in <module>
pyodide.asm.js:9   File "/lib/python3.8/site-packages/pyodide/_base.py", line 270 in run_async
pyodide.asm.js:9   File "/lib/python3.8/site-packages/pyodide/_base.py", line 419 in eval_code_async
pyodide.asm.js:9   File "/lib/python3.8/asyncio/tasks.py", line 280 in __step
pyodide.asm.js:9   File "/lib/python3.8/asyncio/events.py", line 81 in _run
pyodide.js:456 Uncaught RuntimeError: memory access out of bounds
    at PyArray_Broadcast (<anonymous>:wasm-function[1536]:0xcdab3)
    at <anonymous>:wasm-function[1538]:0xce06f
    at __static_809 (<anonymous>:wasm-function[1542]:0xce5f0)
    at byn$fpcast-emu$__static_809 (<anonymous>:wasm-function[6879]:0x219772)
    at __static_861 (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[1972]:0x22cb48)
    at byn$fpcast-emu$__static_861 (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[22302]:0x7be3ba)
    at PyObject_Call (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[763]:0x1caf37)
    at _PyEval_EvalFrameDefault (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[2758]:0x2a9d54)
    at byn$fpcast-emu$_PyEval_EvalFrameDefault (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[15507]:0x7a5f64)
    at _PyEval_EvalCodeWithName (https://【my-domain-name】/wasm.master/pyodide.asm.wasm:wasm-function[2754]:0x2a422c)

@hoodmane
Copy link
Member Author

hoodmane commented Apr 20, 2021

Using import numpy as np; np.meshgrid(range(10), range(10)) on https://pyodide-cdn2.iodide.io/dev/full/console.html? I can't reproduce.

@daoxian
Copy link
Contributor

daoxian commented Apr 20, 2021

Using import numpy as np; np.meshgrid(range(10), range(10)) on https://pyodide-cdn2.iodide.io/dev/full/console.html? I can't reproduce.

OK, you're right. I was using the old version code.

Now, after I've pulled the newest code from git, no crash appears any more.

@rth
Copy link
Member

rth commented Apr 20, 2021

Great! Thanks for checking @daoxian !

hamlet4401 pushed a commit to tytgatlieven/pyodide that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory access out of bounds numpy + chrome 89
3 participants