Skip to main content
Log in

Identifying artificial intelligence (AI) invention: a novel AI patent dataset

  • Published:
The Journal of Technology Transfer Aims and scope Submit manuscript

Abstract

Artificial intelligence (AI) is an area of increasing scholarly and policy interest. To help researchers, policymakers, and the public, this paper describes a novel dataset identifying AI in over 13.2 million patents and pre-grant publications (PGPubs). The dataset, called the Artificial Intelligence Patent Dataset (AIPD), was constructed using machine learning models for each of eight AI component technologies covering areas such as natural language processing, AI hardware, and machine learning. The AIPD contains two data files, one identifying the patents and PGPubs predicted to contain AI and a second file containing the patent documents used to train the machine learning classification models. We also present several evaluation metrics based on manual review by patent examiners with focused expertise in AI, and show that our machine learning approach achieves state-of-the-art performance across existing alternatives in the literature. We believe releasing this dataset will strengthen policy formulation, encourage additional empirical work, and provide researchers with a common base for building empirical knowledge on the determinants and impacts of AI invention.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. CISPT (2018) includes in its analysis Derwent Innovation ThemeScapes, which use statistical and textual analysis (31); CIPO leverages ML to clean data, such as inventor names (47).

  2. For example, see the National Security Commission on Artificial Intelligence (NSCAI) report at https://www.nscai.gov/2021-final-report/.

  3. 561 U.S. 593, 130 S. Ct. 3218.

  4. Other definitions of AI are useful for AI policy making and operational processes at the USPTO. Our definition of AI is not the official definition used by the USPTO.

  5. All but 5224 of the Phase 2 documents (0.34%) were published in 2019 and 2020. See Appendix D in Supplementary Information for details.

  6. See MPEP § 608.01(b).

  7. MPEP § 608.01(k); see also §§ 608.01(i)-(o).

  8. See https://bulkdata.uspto.gov/.

  9. The publication of patent applications as PGPubs began with the American Inventors Protection Act (APIA), enacted November 29, 1999.

  10. See https://cloud.google.com/bigquery.

  11. We did not use AppFT for the PGPub abstract text during Phase 1 due to internal resource constraints at the time of processing the data. However, Google Big Query processes and stores the original AppFT (and PatFT) abstract text in tabular format. The abstract text is also available for download at www.patentsview.org, an open data platform with parsed and value-added USPTO patent data.

  12. The claims of a patent application may change during its examination to address rejections over the prior art, other rejections, and informalities as made by the patent examiner; see MPEP § 706.

  13. Abood and Feltenberger (2018) use word2vec text embedding (116–117). Additionally, our word2vec approach uses code from Persiyanov (2018), see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb.

  14. See https://www.uspto.gov/ip-policy/economic-research/research-datasets.

  15. See https://bulkdata.uspto.gov/data/patent/classification/cpc/.

  16. See MPEP § 905.03(a) for a description of the CPC and its use.

  17. A patent family is a group of patent applications and/or granted patents that share a common applicant/owner and share a similar inventive concept. We use the “national family” variety (see https://www.wipo.int/edocs/mdocs/aspac/en/wipo_ip_bkk_12/wipo_ip_bkk_12_www_238983.pdf).

  18. See description at the USPTO Public Search Facility webpage: https://www.uspto.gov/learning-and-resources/support-centers/public-search-facility/public-search-facility and MPEP § 902.03(e).

  19. See https://clarivate.com/derwent/solutions/derwent-world-patent-index-dwpi/.

  20. Note the seed set, L1 and L2 expansions, and anti-seed generation used the data from Phase 1—all model development and training used Phase 1 data. All the patent documents in the updated Phase 2 data are in the “remaining” set of documents.

  21. The word2vec encoding used the continuous bag of words (CBOW) model with a window size of 10 for abstracts and 5 for claims. It also ignored any word that appeared less than 10 times in the respective text.

  22. See Feltenberger (2019) at https://github.com/google/patents-public-data/tree/master/models/landscaping.

  23. The one exception is the computer vision classification model. The trained model was not properly saved, and we retrained it using the same underlying training data and code. Hence, the results are consistent with our original model trained in the Phase 1 analysis.

  24. The pairs were patent examiners 1–2, 1–3, 1–4, 2–3, 2–4, and 3–4. Each pair reviewed 36 patent documents in the consolidated seed group (216 total), 36 patent documents in the consolidated anti-seed group (216 total), and 61 or 63 patent documents in the consolidated L1, L2, and remaining group (368 total).

  25. See discussion at https://www.scikit-yb.org/en/latest/api/classifier/threshold.html. Since a patent document is classified as “any AI” if any prediction from the eight component models is at or above the threshold, the largest prediction from all eight models drives the “any AI” determination.

  26. The patent and PGPub numbers in our dataset are as they appear on the printed U.S. publications, except that special characters (e.g., commas and slashes) were removed.

  27. With the exception of reissue patents, which would require information regarding application priority relationships.

References

  • Abood, A., & Feltenberger, D. (2018). Automated patent landscaping. Artificial Intelligence and Law, 26(2), 103–125.

    Article  Google Scholar 

  • Alderucci, D., Branstetter, L., Hovy, E., Runge, A., & Zolas, N. (2020). Quantifying the impact of AI on productivity and labor demand: Evidence from US census microdata. Mimeo.

  • Arora, A., Belenzon, S., & Patacconi, A. (2018). The decline of science in corporate R&D. Strategic Management Journal, 39(1), 3–32.

    Article  Google Scholar 

  • Arora, A., Belenzon, S., Patacconi, A., & Suh, J. (2020). The changing structure of American innovation: Some cautionary remarks for economic growth. Innovation Policy and the Economy, 20(1), 39–93.

    Article  Google Scholar 

  • Atack, J., Bateman, F., & Margo, R. A. (2008). Steam power, establishment size, and labor productivity growth in nineteenth century American manufacturing. Explorations in Economic History, 45(2), 185–198.

    Article  Google Scholar 

  • Baruffaldi, S., van Beuzekom, B., Dernis, H., Harhoff, D., Rao, N., Rosenfeld, D. & Squicciarini, M. (2020). Identifying and measuring developments in artificial intelligence: Making the impossible possible.

  • Basu, S., & Fernald, J. (2007). Information and communications technology as a general-purpose technology: Evidence from US industry data. German Economic Review, 8(2), 146–173.

    Article  Google Scholar 

  • Benassi, M., Grinza, E., & Rentocchini, F. (2019). The rush for patents in the fourth industrial revolution: An exploration of patenting activity at the European Patent Office.

  • Bresnahan, T. F., & Trajtenberg, M. (1995). General purpose technologies ‘Engines of growth’? Journal of Econometrics, 65(1), 83–108.

    Article  Google Scholar 

  • Chien, C., Halkowski, N., He M., & Swartz R. (2020). The impact of 101 on patent prosecution—Post guidance updates. 2020 Patently-O Patent Law Journal.

  • Choi, S., Lee, H., Park, E. L., & Choi, S. (2019). Deep patent landscaping model using transformer and graph embedding. arXiv preprint arXiv: 1903.05823.

  • CIPO (Canadian Intellectual Property Office), 2020. Processing Artificial Intelligence: Highlighting the Canadian Patent Landscape. Gatineau, Quebec: Canadian Intellectual Property Office.

  • CISTP (China Institute for Science and Technology Policy at Tsinghua University). (2018). China AI development report. China Institute for Science and Technology Policy at Tsinghua University.

    Google Scholar 

  • Cockburn, I. M., Henderson, R., & Stern, S. (2019). The impact of artificial intelligence on innovation. In A. Agrawal, J. Gans, & A. Goldfarb (Eds.), The economics of artificial intelligence: An agenda (pp. 115–146). University of Chicago Press.

    Google Scholar 

  • Crafts, N. (2004). Steam as a general purpose technology: A growth accounting perspective. The Economic Journal, 114(495), 338–351.

    Article  Google Scholar 

  • Crafts, N., & Mills, T. C. (2004). Was 19th century British growth steam-powered?: The climacteric revisited. Explorations in Economic History, 41(2), 156–171.

    Article  Google Scholar 

  • Damioli, G., Van Roy, V., & Vertesy, D. (2021). The impact of artificial intelligence on labor productivity. Eurasian Business Review, 11(1), 1–25.

    Article  Google Scholar 

  • Felten, E., Raj, M. & Seamans, R. (2021). Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses. Strategic Management Journal.

  • Feltenberger, D. (2019). “Automated patent landscaping” github post (January 11). https://github.com/google/patents-public-data/tree/master/models/landscaping

  • Fujii, H., & Managi, S. (2018). Trends and priority shifts in artificial intelligence technology invention: A global patent analysis. Economic Analysis and Policy, 58, 60–69.

    Article  Google Scholar 

  • Furman, J., & Seamans, R. (2019). AI and the economy. Innovation Policy and the Economy, 19(1), 161–191.

    Article  Google Scholar 

  • Graham, S. J., Marco, A. C., & Miller, R. (2018). The USPTO patent examination research dataset: A window on patent processing. Journal of Economics and Management Strategy, 27(3), 554–578.

    Article  Google Scholar 

  • Harris, S., Trippe, A., Challis, D., & Swycher, N. (2020). Construction and evaluation of gold standards for patent classification—A case study on quantum computing. World Patent Information61, 101961.

  • Hartmann, P., & Henkel, J. (2020). The rise of corporate science in AI: Data as a strategic resource. Academy of Management Discoveries, 6(3), 359–381.

    Google Scholar 

  • Jovanovic, B., & Rousseau, P. L. (2005). General purpose technologies. In Handbook of economic growth (Vol. 1, pp. 1181–1224). Amsterdam: Elsevier B.V.

  • JPO (Japan Patent Office). 2019. Recent trends in AI-related inventions—Report. Toyko: Japan Patent Office.

  • Kesan, J., & Wang, R. (2020). Eligible subject matter at the patent office: An empirical study of the influence of Alice on patent examiners and patent applications. Minnesota Law Review, 105(2), 527.

    Google Scholar 

  • Kim, S. (2005). Industrialization and urbanization: Did the steam engine contribute to the growth of cities in the United States? Explorations in Economic History, 42(4), 586–598.

    Article  Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781.

  • PatentsView. 2021. https://www.patentsview.org.

  • Persiyanov, D. (2018). “*2Vec File-based Training: API Tutorial.” (last commit September 14). https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb.

  • Raj, M., & Seamans, R. (2018). AI, labor, productivity, and the need for firm-level data. Economics of Artificial Intelligence, May 14.

  • Rosenberg, N., & Trajtenberg, M. (2004). A general-purpose technology at work: The Corliss steam engine in the late-nineteenth-century United States. The Journal of Economic History, 64(1), 61–99.

    Article  Google Scholar 

  • Spulber, D. F. (2015). How patents provide the foundation of the market for inventions. Journal of Competition Law and Economics, 11(2), 271–316.

    Article  Google Scholar 

  • Toole, A., & Pairolero, N. (2020). Adjusting to Alice: USPTO patent examination outcomes after Alice Corp. v. CLS Bank International. Alexandria, VA: United States Patent and Trademark Office.

  • Toole, A., Pairolero, N., Giczy, A., Forman, J., Pulliam, C., Such, M., Chaki, K., Orange, D., Thomas Homescu, A., Frumkin K., Chen, Y. Y., Gonzales, V., Hannon, C., Melnick, S., Nilsson, E., & Rifkin, B. (2020b). Inventing AI: Tracing the diffusion of artificial intelligence with U.S. patents. (October). Alexandria, VA: United States Patent and Trademark Office.

  • Toole, A. A., Pairolero, N. A., Forman, J. Q., & Giczy, A. V. (2020a). The promise of machine learning for patent landscaping. Santa Clara High Technology LJ, 36, 433.

    Google Scholar 

  • Trippe, A. (2015). Guidelines for preparing patent landscape reports. World Intellectual Property Office.

    Google Scholar 

  • UKIPO (United Kingdom Intellectual Property Office). (2019). Artificial intelligence—A worldwide overview of AI patents. Newport, UK: United Kingdom Intellectual Property Office.

  • USPTO (United States Patent and Trademark Office). (2017). Patent Eligible Subject Matter: Report on Views and Recommendations from the Public. Alexandria, Virginia: United States Patent and Trademark Office.

  • USPTO (United States Patent and Trademark Office). (2020). Manual of Patent Examining Procedure (MPEP), ninth edition, revision 10.2019 (last revised June). Available from https://www.uspto.gov/web/offices/pac/mpep/index.html.

  • USPTO (United States Patent and Trademark Office). (2021). Withdrawn patent numbers website (last updated June 1). Available from https://www.uspto.gov/patents/search/withdrawn-patent-numbers.

  • Webb, M., Short, N., Bloom, N., & Lerner, J. (2018). Some facts of high-tech patenting (No. w24793). National Bureau of Economic Research.

  • WIPO (World Intellectual Property Office). (2019). WIPO Technology Trends 2019—Artificial Intelligence. Geneva, Switzerland: World Intellectual Property Organization.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew A. Toole.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 41 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Giczy, A.V., Pairolero, N.A. & Toole, A.A. Identifying artificial intelligence (AI) invention: a novel AI patent dataset. J Technol Transf 47, 476–505 (2022). https://doi.org/10.1007/s10961-021-09900-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10961-021-09900-2

Keywords

JEL Classification

Navigation