Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jan 4;44(D1):D1220-8.
doi: 10.1093/nar/gkv1253. Epub 2015 Nov 17.

SureChEMBL: a large-scale, chemically annotated patent document database

Affiliations

SureChEMBL: a large-scale, chemically annotated patent document database

George Papadatos et al. Nucleic Acids Res. .

Abstract

SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the SureChEMBL data pipeline from the raw patent feed to the standardized compounds in the database.
Figure 2.
Figure 2.
(A) Field keyword-based search against full text and patent bibliographic metadata. (B) The equivalent search using the Lucene query fields syntax.
Figure 3.
Figure 3.
Similarity search for the near neighbours of the approved drug donepezil. The search results will have a molecular weight range of 300–800. Furthermore, only compounds that are extracted from the claims or description sections and images will be retrieved.
Figure 4.
Figure 4.
The results of search can be either a list of documents (A) or compounds (B). The former is sorted in reverse chronological order and provides a preview of the each document by means of patent ID, publication date, assignee, classification code(s), title and language. Moreover, for each document, members of the same patent family (i.e. a number of patent documents by the same inventors describing the same invention filed in multiple countries) across different patent authorities may be retrieved (listed in dark background). Finally, the chemistry annotated in each document can be exported and downloaded. In case of the compound hits (B), the report card view may be viewed for each hit (e.g. https://www.surechembl.org/chemical/SCHEMBL16354556 and Supplementary Figure S2). Additionally, users may choose a number of these search hits and retrieve the patent documents associated with their selection.
Figure 5.
Figure 5.
The export chemistry modal window allows users to filter compounds based on calculated physicochemical and related properties, simple counts and frequency of occurrence.

Similar articles

Cited by

References

    1. Downs G.M., Barnard J.M. Chemical patent information systems. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2011;1:727–741.
    1. Akhondi S.A., Klenner A.G., Tyrchan C., Manchala A.K., Boppana K., Lowe D., Zimmermann M., Jagarlapudi S.A.R.P., Sayle R., Kors J.A., et al. Annotated chemical patent corpus: a gold standard for text mining. PLoS One. 2014;9:e107477. - PMC - PubMed
    1. Bregonje M. Patents: a unique source for scientific technical information in chemistry related industry. World Patent Inf. 2005;27:309–315.
    1. Schneider N., Lowe D.M., Sayle R.A., Landrum G.A. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. J. Chem. Inf. Model. 2015;55:39–53. - PubMed
    1. Kettle J.G., Ward R.A., Griffen E. Data-mining patent literature for novel chemical reagents for use in medicinal chemistry design. Medchemcomm. 2010;1:331–338.

Publication types

Substances