Abstract
Artificial intelligence (AI) is an area of increasing scholarly and policy interest. To help researchers, policymakers, and the public, this paper describes a novel dataset identifying AI in over 13.2 million patents and pre-grant publications (PGPubs). The dataset, called the Artificial Intelligence Patent Dataset (AIPD), was constructed using machine learning models for each of eight AI component technologies covering areas such as natural language processing, AI hardware, and machine learning. The AIPD contains two data files, one identifying the patents and PGPubs predicted to contain AI and a second file containing the patent documents used to train the machine learning classification models. We also present several evaluation metrics based on manual review by patent examiners with focused expertise in AI, and show that our machine learning approach achieves state-of-the-art performance across existing alternatives in the literature. We believe releasing this dataset will strengthen policy formulation, encourage additional empirical work, and provide researchers with a common base for building empirical knowledge on the determinants and impacts of AI invention.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
CISPT (2018) includes in its analysis Derwent Innovation ThemeScapes, which use statistical and textual analysis (31); CIPO leverages ML to clean data, such as inventor names (47).
For example, see the National Security Commission on Artificial Intelligence (NSCAI) report at https://www.nscai.gov/2021-final-report/.
561 U.S. 593, 130 S. Ct. 3218.
Other definitions of AI are useful for AI policy making and operational processes at the USPTO. Our definition of AI is not the official definition used by the USPTO.
All but 5224 of the Phase 2 documents (0.34%) were published in 2019 and 2020. See Appendix D in Supplementary Information for details.
See MPEP § 608.01(b).
MPEP § 608.01(k); see also §§ 608.01(i)-(o).
The publication of patent applications as PGPubs began with the American Inventors Protection Act (APIA), enacted November 29, 1999.
We did not use AppFT for the PGPub abstract text during Phase 1 due to internal resource constraints at the time of processing the data. However, Google Big Query processes and stores the original AppFT (and PatFT) abstract text in tabular format. The abstract text is also available for download at www.patentsview.org, an open data platform with parsed and value-added USPTO patent data.
The claims of a patent application may change during its examination to address rejections over the prior art, other rejections, and informalities as made by the patent examiner; see MPEP § 706.
Abood and Feltenberger (2018) use word2vec text embedding (116–117). Additionally, our word2vec approach uses code from Persiyanov (2018), see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb.
See MPEP § 905.03(a) for a description of the CPC and its use.
A patent family is a group of patent applications and/or granted patents that share a common applicant/owner and share a similar inventive concept. We use the “national family” variety (see https://www.wipo.int/edocs/mdocs/aspac/en/wipo_ip_bkk_12/wipo_ip_bkk_12_www_238983.pdf).
See description at the USPTO Public Search Facility webpage: https://www.uspto.gov/learning-and-resources/support-centers/public-search-facility/public-search-facility and MPEP § 902.03(e).
Note the seed set, L1 and L2 expansions, and anti-seed generation used the data from Phase 1—all model development and training used Phase 1 data. All the patent documents in the updated Phase 2 data are in the “remaining” set of documents.
The word2vec encoding used the continuous bag of words (CBOW) model with a window size of 10 for abstracts and 5 for claims. It also ignored any word that appeared less than 10 times in the respective text.
See Feltenberger (2019) at https://github.com/google/patents-public-data/tree/master/models/landscaping.
The one exception is the computer vision classification model. The trained model was not properly saved, and we retrained it using the same underlying training data and code. Hence, the results are consistent with our original model trained in the Phase 1 analysis.
The pairs were patent examiners 1–2, 1–3, 1–4, 2–3, 2–4, and 3–4. Each pair reviewed 36 patent documents in the consolidated seed group (216 total), 36 patent documents in the consolidated anti-seed group (216 total), and 61 or 63 patent documents in the consolidated L1, L2, and remaining group (368 total).
See discussion at https://www.scikit-yb.org/en/latest/api/classifier/threshold.html. Since a patent document is classified as “any AI” if any prediction from the eight component models is at or above the threshold, the largest prediction from all eight models drives the “any AI” determination.
The patent and PGPub numbers in our dataset are as they appear on the printed U.S. publications, except that special characters (e.g., commas and slashes) were removed.
With the exception of reissue patents, which would require information regarding application priority relationships.
References
Abood, A., & Feltenberger, D. (2018). Automated patent landscaping. Artificial Intelligence and Law, 26(2), 103–125.
Alderucci, D., Branstetter, L., Hovy, E., Runge, A., & Zolas, N. (2020). Quantifying the impact of AI on productivity and labor demand: Evidence from US census microdata. Mimeo.
Arora, A., Belenzon, S., & Patacconi, A. (2018). The decline of science in corporate R&D. Strategic Management Journal, 39(1), 3–32.
Arora, A., Belenzon, S., Patacconi, A., & Suh, J. (2020). The changing structure of American innovation: Some cautionary remarks for economic growth. Innovation Policy and the Economy, 20(1), 39–93.
Atack, J., Bateman, F., & Margo, R. A. (2008). Steam power, establishment size, and labor productivity growth in nineteenth century American manufacturing. Explorations in Economic History, 45(2), 185–198.
Baruffaldi, S., van Beuzekom, B., Dernis, H., Harhoff, D., Rao, N., Rosenfeld, D. & Squicciarini, M. (2020). Identifying and measuring developments in artificial intelligence: Making the impossible possible.
Basu, S., & Fernald, J. (2007). Information and communications technology as a general-purpose technology: Evidence from US industry data. German Economic Review, 8(2), 146–173.
Benassi, M., Grinza, E., & Rentocchini, F. (2019). The rush for patents in the fourth industrial revolution: An exploration of patenting activity at the European Patent Office.
Bresnahan, T. F., & Trajtenberg, M. (1995). General purpose technologies ‘Engines of growth’? Journal of Econometrics, 65(1), 83–108.
Chien, C., Halkowski, N., He M., & Swartz R. (2020). The impact of 101 on patent prosecution—Post guidance updates. 2020 Patently-O Patent Law Journal.
Choi, S., Lee, H., Park, E. L., & Choi, S. (2019). Deep patent landscaping model using transformer and graph embedding. arXiv preprint arXiv: 1903.05823.
CIPO (Canadian Intellectual Property Office), 2020. Processing Artificial Intelligence: Highlighting the Canadian Patent Landscape. Gatineau, Quebec: Canadian Intellectual Property Office.
CISTP (China Institute for Science and Technology Policy at Tsinghua University). (2018). China AI development report. China Institute for Science and Technology Policy at Tsinghua University.
Cockburn, I. M., Henderson, R., & Stern, S. (2019). The impact of artificial intelligence on innovation. In A. Agrawal, J. Gans, & A. Goldfarb (Eds.), The economics of artificial intelligence: An agenda (pp. 115–146). University of Chicago Press.
Crafts, N. (2004). Steam as a general purpose technology: A growth accounting perspective. The Economic Journal, 114(495), 338–351.
Crafts, N., & Mills, T. C. (2004). Was 19th century British growth steam-powered?: The climacteric revisited. Explorations in Economic History, 41(2), 156–171.
Damioli, G., Van Roy, V., & Vertesy, D. (2021). The impact of artificial intelligence on labor productivity. Eurasian Business Review, 11(1), 1–25.
Felten, E., Raj, M. & Seamans, R. (2021). Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses. Strategic Management Journal.
Feltenberger, D. (2019). “Automated patent landscaping” github post (January 11). https://github.com/google/patents-public-data/tree/master/models/landscaping
Fujii, H., & Managi, S. (2018). Trends and priority shifts in artificial intelligence technology invention: A global patent analysis. Economic Analysis and Policy, 58, 60–69.
Furman, J., & Seamans, R. (2019). AI and the economy. Innovation Policy and the Economy, 19(1), 161–191.
Graham, S. J., Marco, A. C., & Miller, R. (2018). The USPTO patent examination research dataset: A window on patent processing. Journal of Economics and Management Strategy, 27(3), 554–578.
Harris, S., Trippe, A., Challis, D., & Swycher, N. (2020). Construction and evaluation of gold standards for patent classification—A case study on quantum computing. World Patent Information, 61, 101961.
Hartmann, P., & Henkel, J. (2020). The rise of corporate science in AI: Data as a strategic resource. Academy of Management Discoveries, 6(3), 359–381.
Jovanovic, B., & Rousseau, P. L. (2005). General purpose technologies. In Handbook of economic growth (Vol. 1, pp. 1181–1224). Amsterdam: Elsevier B.V.
JPO (Japan Patent Office). 2019. Recent trends in AI-related inventions—Report. Toyko: Japan Patent Office.
Kesan, J., & Wang, R. (2020). Eligible subject matter at the patent office: An empirical study of the influence of Alice on patent examiners and patent applications. Minnesota Law Review, 105(2), 527.
Kim, S. (2005). Industrialization and urbanization: Did the steam engine contribute to the growth of cities in the United States? Explorations in Economic History, 42(4), 586–598.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781.
PatentsView. 2021. https://www.patentsview.org.
Persiyanov, D. (2018). “*2Vec File-based Training: API Tutorial.” (last commit September 14). https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb.
Raj, M., & Seamans, R. (2018). AI, labor, productivity, and the need for firm-level data. Economics of Artificial Intelligence, May 14.
Rosenberg, N., & Trajtenberg, M. (2004). A general-purpose technology at work: The Corliss steam engine in the late-nineteenth-century United States. The Journal of Economic History, 64(1), 61–99.
Spulber, D. F. (2015). How patents provide the foundation of the market for inventions. Journal of Competition Law and Economics, 11(2), 271–316.
Toole, A., & Pairolero, N. (2020). Adjusting to Alice: USPTO patent examination outcomes after Alice Corp. v. CLS Bank International. Alexandria, VA: United States Patent and Trademark Office.
Toole, A., Pairolero, N., Giczy, A., Forman, J., Pulliam, C., Such, M., Chaki, K., Orange, D., Thomas Homescu, A., Frumkin K., Chen, Y. Y., Gonzales, V., Hannon, C., Melnick, S., Nilsson, E., & Rifkin, B. (2020b). Inventing AI: Tracing the diffusion of artificial intelligence with U.S. patents. (October). Alexandria, VA: United States Patent and Trademark Office.
Toole, A. A., Pairolero, N. A., Forman, J. Q., & Giczy, A. V. (2020a). The promise of machine learning for patent landscaping. Santa Clara High Technology LJ, 36, 433.
Trippe, A. (2015). Guidelines for preparing patent landscape reports. World Intellectual Property Office.
UKIPO (United Kingdom Intellectual Property Office). (2019). Artificial intelligence—A worldwide overview of AI patents. Newport, UK: United Kingdom Intellectual Property Office.
USPTO (United States Patent and Trademark Office). (2017). Patent Eligible Subject Matter: Report on Views and Recommendations from the Public. Alexandria, Virginia: United States Patent and Trademark Office.
USPTO (United States Patent and Trademark Office). (2020). Manual of Patent Examining Procedure (MPEP), ninth edition, revision 10.2019 (last revised June). Available from https://www.uspto.gov/web/offices/pac/mpep/index.html.
USPTO (United States Patent and Trademark Office). (2021). Withdrawn patent numbers website (last updated June 1). Available from https://www.uspto.gov/patents/search/withdrawn-patent-numbers.
Webb, M., Short, N., Bloom, N., & Lerner, J. (2018). Some facts of high-tech patenting (No. w24793). National Bureau of Economic Research.
WIPO (World Intellectual Property Office). (2019). WIPO Technology Trends 2019—Artificial Intelligence. Geneva, Switzerland: World Intellectual Property Organization.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Giczy, A.V., Pairolero, N.A. & Toole, A.A. Identifying artificial intelligence (AI) invention: a novel AI patent dataset. J Technol Transf 47, 476–505 (2022). https://doi.org/10.1007/s10961-021-09900-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10961-021-09900-2