Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot retrieve some cluster files #20

Open
L40S38 opened this issue Sep 24, 2022 · 2 comments
Open

Cannot retrieve some cluster files #20

L40S38 opened this issue Sep 24, 2022 · 2 comments

Comments

@L40S38
Copy link

L40S38 commented Sep 24, 2022

Hi.

I executed the command to evaluate on the Vertex dataset or the ProSPECCTS dataset.
But I found almost the same error like below.

(I exported as $STRUCTURE_DATA_DIR = $DEEPLYTOUGH/datasets_structure. Also, I omitted the path to the repository)

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 5324k 100 5324k 0 0 1471k 0 0:00:03 0:00:03 --:--:-- 1472k
INFO:datasets.vertex:Preprocessing: downloading data and extracting pockets, this will take time.
INFO:root:cluster file path: DeeplyTough/datasets_structure/bc-30.out
WARNING:root:Cluster definition not found, will download a fresh one.
WARNING:root:However, this will very likely lead to silent incompatibilities with any old 'pdbcode_mappings.pickle' files! Please better remove those manually.
Traceback (most recent call last):
File "DeeplyTough/deeplytough/scripts/vertex_benchmark.py", line 68, in
main()
File "DeeplyTough/deeplytough/scripts/vertex_benchmark.py", line 32, in main
database.preprocess_once()
File "DeeplyTough/deeplytough/datasets/vertex.py", line 49, in preprocess_once
clusterer = RcsbPdbClusters(identity=30)
File "DeeplyTough/deeplytough/misc/utils.py", line 248, in init
self._fetch_cluster_file()
File "DeeplyTough/deeplytough/misc/utils.py", line 262, in _fetch_cluster_file
self._download_cluster_sets(cluster_file_path)
File "DeeplyTough/deeplytough/misc/utils.py", line 253, in _download_cluster_sets
request.urlretrieve(f'https://cdn.rcsb.org/resources/sequence/clusters/bc-{self.identity}.out', cluster_file_path)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

I successed when evaluated on the TOUGH-M1 dataset, so I'm afraid of some URL to the Vertex and ProSPECCTS data is expired.
Would you mind check about that?

@JoshuaMeyers
Copy link
Collaborator

Hey @L40S38, thanks for opening a ticket. It seems this is due to the RCSB PDBs cluster file moving. See https://www.rcsb.org/news/feature/6205750d8f40f9265109d39f (in fact its discontinued and changed, so this may even have scientific implications for DeeplyTough)

I will have a look into it. If you don't need to use the cluster file (e.g. if you are happy with random splitting, or you just want to run the existing models) I believe you can just specify a different splitting method.

@L40S38
Copy link
Author

L40S38 commented Dec 24, 2022

Hi, Long time no see

I solved this problem, so I tell you the way.

・the URL to retrieve the cluster file (in deeplytough/misc/utils.py) should be changed as below.

https://cdn.rcsb.org/resources/sequence/clusters/clusters-by-entity-{self.identity}.txt

・Also, the expression of sequences in the cluster file was changed to {protein_id}_{entity_id} from {protein_id}_{chain_id}. Then I couldn't get cluster id of most proteins.
Thus, you should get the entity id of the chain in some way and search the cluster it belongs, or split in other way (e.g. uniprot_folds)
In my case, when getting pdbcode_mappings.pickle in preprocessing before using TOUGH-M1 dataset, get entity_id in pdb_chain_to_uniprot in deeplytough/datasets/toughm1.py

def pdb_chain_to_uniprot(pdb_code, query_chain_id):
            """
            Get pdb chain mapping to uniprot accession using the pdbe api
            """
            result = 'None'
            entity_id = 'None'
            r = requests.get(f'http://www.ebi.ac.uk/pdbe/api/mappings/uniprot/{pdb_code}')
            fam = r.json()[pdb_code]['UniProt']

            for fam_id in fam.keys():
                for chain in fam[fam_id]['mappings']:
                    if chain['chain_id'] == query_chain_id:
                        if result != 'None' and fam_id != result:
                            logger.warning(f'DUPLICATE {fam_id} {result}')
                        result = fam_id
                        entity_id = chain['entity_id']
            if result == 'None':
                logger.warning(f'No uniprot accession found for {pdb_code}: {query_chain_id}')
            return entity_id

I wish you well in your execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants