Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema validation crashes when running in an environment without internet access #1916

Closed
1 task done
lhenkelm opened this issue Jul 6, 2022 · 8 comments · Fixed by #1917
Closed
1 task done

Schema validation crashes when running in an environment without internet access #1916

lhenkelm opened this issue Jul 6, 2022 · 8 comments · Fixed by #1917
Assignees
Labels
bug Something isn't working

Comments

@lhenkelm
Copy link
Contributor

lhenkelm commented Jul 6, 2022

Summary

In master and the 0.7.0 release candidate, pyhf operations involving model validation will crash in offline environments with a RefResolutionError. This is a common situation e.g. with worker nodes on HTC clusters.
The bug was introduced after 0.6.3, I think in #1753 where the pre-loading was dropped.

OS / Environment

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="http://cern.ch/linux/"
BUG_REPORT_URL="http://cern.ch/linux/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Steps to Reproduce

I don't know a good way to prepare the environment to demonstrate this.
But the below test exposes the attempt by the RefResolver to resolve the schema id through the https URL, and fails against the release candidate/master, but passes in 0.6.3

from functools import partial

import pytest
import jsonschema
import pyhf

def make_asserting_handler(origin):
    def asserting_handler(*args, **kwargs):
        raise AssertionError(
            f'called URL request handler from {origin} with args={args!r}, kwargs={kwargs!r} '
            'when no call should have been needed'
        )

    return asserting_handler


@pytest.fixture
def no_http_jsonschema_ref_resolving(monkeypatch):
    asserting_handler = make_asserting_handler('handlers')
    handlers = {
        'https': asserting_handler,
        'http': asserting_handler,
    }
    WrappedResolver = partial(jsonschema.RefResolver, handlers=handlers)
    monkeypatch.setattr('jsonschema.RefResolver', WrappedResolver, raising=True)

def test_preloaded_cache(
    no_http_jsonschema_ref_resolving,
):
    spec = {
        'channels': [
            {
                'name': 'singlechannel',
                'samples': [
                    {
                        'name': 'signal',
                        'data': [10],
                        'modifiers': [
                            {'name': 'mu', 'type': 'normfactor', 'data': None}
                        ],
                    },
                    {
                        'name': 'background',
                        'data': [20],
                        'modifiers': [
                            {
                                'name': 'uncorr_bkguncrt',
                                'type': 'shapesys',
                                'data': [30],
                            }
                        ],
                    },
                ],
            }
        ]
    }
    try:
        pyhf.schema.validate(spec, 'model.json')
    except AttributeError:
        pyhf.utils.validate(spec, 'model.json')
                                               

File Upload (optional)

No response

Expected Results

I expect schema validation to succeed without crashing even when there is no network access that allows resolving the https schema-ids.

Actual Results

jsonschema.exceptions.RefResolutionError: HTTPSConnectionPool(host='scikit-hep.org', port=443): Max retries exceeded with url: /pyhf/schemas/1.0.0/defs.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2b2bb8457c40>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

pyhf Version

pyhf, version 0.7.0rc2

Code of Conduct

  • I agree to follow the Code of Conduct
@lhenkelm lhenkelm added bug Something isn't working needs-triage Needs a maintainer to categorize and assign labels Jul 6, 2022
@lhenkelm lhenkelm changed the title schema validation crashes when running in an enironment without internet access schema validation crashes when running in an environment without internet access Jul 6, 2022
@lhenkelm lhenkelm changed the title schema validation crashes when running in an environment without internet access Schema validation crashes when running in an environment without internet access Jul 6, 2022
@masonproffitt
Copy link
Contributor

Related to this is a large increase in the execution time of pyhf.Model.__init__() due to the network request. For example, using the "Hello World" case from the README in v0.6.3, I get:

$ python -m timeit -s 'import pyhf' 'pyhf.simplemodels.uncorrelated_background(signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0])'
50 loops, best of 5: 4.36 msec per loop

And for v0.7.0rc1:

$ python -m timeit -s 'import pyhf' 'pyhf.simplemodels.uncorrelated_background(signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0])'
1 loop, best of 5: 218 msec per loop

For any code that builds many Models often (as some of mine does), this 50x slowdown can suddenly dominate the execution time.

@kratsg
Copy link
Contributor

kratsg commented Jul 11, 2022

For any code that builds many Models often (as some of mine does), this 50x slowdown can suddenly dominate the execution time.

The network request should at least be cache'd for the next time the model is built. Your timeit is starting up a new session each time (which may or may not be what the user wants).

@kratsg
Copy link
Contributor

kratsg commented Jul 11, 2022

I expect schema validation to succeed without crashing even when there is no network access that allows resolving the https schema-ids.

Yeah, this is a regression. I thought I had this fixed because I remember mentioning it somewhere.

@masonproffitt
Copy link
Contributor

For any code that builds many Models often (as some of mine does), this 50x slowdown can suddenly dominate the execution time.

The network request should at least be cache'd for the next time the model is built. Your timeit is starting up a new session each time (which may or may not be what the user wants).

It's not starting a new session:

python -m timeit -s 'import pyhf' -v 'pyhf.simplemodels.uncorrelated_background(signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0])'
1 loop -> 0.491 secs

raw times: 209 msec, 232 msec, 242 msec, 229 msec, 218 msec

1 loop, best of 5: 209 msec per loop

You can see the first Timer.autorange() call takes 0.491 s, which is with a cold cache. The rest of the calls take ~0.2 s, which is basically just the HTTPS round trip time. Even with the file data cached, it's still 50x slower than before.

@kratsg
Copy link
Contributor

kratsg commented Jul 11, 2022

You can see the first Timer.autorange() call takes 0.491 s, which is with a cold cache. The rest of the calls take ~0.2 s, which is basically just the HTTPS round trip time. Even with the file data cached, it's still 50x slower than before.

That's weird as the resolver shouldn't use https calls once cached... so there must be something else going on here?

@masonproffitt
Copy link
Contributor

masonproffitt commented Jul 11, 2022

Every time pyhf.schema.validate() is called, a new jsonschema.RefResolver is created. The default remote cache is a per-instance attribute of RefResolver, so the cache will never be fully utilized by pyhf in this configuration.

Edit: Ah, but this part hasn't changed since the last release. You're talking about the other cache (in RefResolver.store) that isn't working as intended...

Edit 2: Okay, the answer there is similar and still in that same code (https://github.com/python-jsonschema/jsonschema/blob/v4.7.1/jsonschema/validators.py#L696). RefResolver.__init__() only copies the entries from the store passed to it, so it can't propagate any data back to pyhf.schema.variables.SCHEMA_CACHE, so each time a RefResolver is created, it doesn't have the data that was cached by any previous RefResolvers.

@kratsg
Copy link
Contributor

kratsg commented Jul 11, 2022

Interesting, then store seems pretty useless on its own since I would have assumed it would pick through that first based on documentation

        A mapping from URIs to documents to cache

So we'll have to refactor this based on the unexpected behavior in jsonschema.

EDIT: a naive solution would be to add the full URI for defs.json in the store, in addition to the local path, and have them both point to the same object in memory. This isn't the greatest, but I see the copy here.

@lhenkelm
Copy link
Contributor Author

EDIT: a naive solution would be to add the full URI for defs.json in the store, in addition to the local path, and have them both point to the same object in memory. This isn't the greatest, but I see the copy here.

Sorry if I am missing your point, but is that not the behavior of v0.6.3? load_schema caches the schema under its "$id" which for 'defs.json' is "https://scikit-hep.org/pyhf/schemas/1.0.0/defs.json".

@matthewfeickert matthewfeickert removed the needs-triage Needs a maintainer to categorize and assign label Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants