Slides from a keynote presentation at CSV Conf 2023 in Buenos Aires, Argentina. April 19th, 2023
Software impacts virtually all areas of research but has been a heavily undervalued contribution. Over the past decade alone, the research software landscape has changed dramatically. It is now substantially easier to start new software projects, find technical resources, and join a friendly community of practice. The research software engineer career track has also taken off and made it easier for many individuals to build careers in this field. However, several key challenges remain. Despite the growing recognition of research software, it is still challenging to demonstrate impact or find support for the maintenance of existing software. In this talk, I describe some ideas on how to uncover software that is driving research and construct knowledge graphs to ask questions about software use and sustainability. I also describe the various conditions necessary to turn nascent software projects into sustainable ecosystems.
- Since research software is poorly cited, it’s hard to get a good picture of the software used in research. While software bills of materials are technically easy to generate and will provide a lot of value, they are not the norm in research or publishing.
- One workaround is to extract scientific software entities from PDFs using tools like Grobid and the software mentions extractor. If carried out on a substantial collection of articles from a field where open source is widely used, it would be possible to ask all kinds of interesting questions like which software is driving research in a certain area, where the opportunities and challenges are, and how to use of tools is changing over time.
- Many researchers write last-mile analysis code that never goes any further. Some of this code, especially the implementation of new methods, may see the light of day as prototype software. These are minimum viable prototypes, with a small test suite and documentation but not designed for speed or stability. A subset of prototypes that find product-market fit are the ones that enter the research software infrastructure space and need to be sustained.
- One way for software projects to raise their visibility is to align roadmaps with adjacent tools (adjacent in the sense of hard dependencies or usage-based dependencies). This would reduce friction, allow for resource sharing, and raise visibility as a collection of tools (e.g. spatial data science, Tidyverse)
- Besides solving technical challenges by aligning with the local ecosystem, projects also need to be in alignment with the larger ecosystem (actors and institutions that enable the work).
- The definitions of software sustainability are clear, but a broader definition I provide is that “Software is sustainable as long as the people behind it have the resources to continue fulfilling its mission”.
- There are examples of widely used software that have run out of resources while dealing with an outdated stack. Rather than sustain those tools, the community can choose to replace them with something more modern and aligned with the needs of users (see the IRAF → Astropy example below). In other words, not everything needs to be sustained forever.
- At POSE training, we have identified 5 core areas that are necessary to sustain an OSE. These are org structure (the managing org that can guide future growth), governance (robust decision-making and collaboration management), business perspectives (managing hidden infrastructure costs and resources, which includes funding), security (technical and non-technical threats), and community.
- Using Nadia Eghbal’s taxonomy (toy, club, federation & stadium), it would be a good exercise to categorize your project to see how best to engage your audience in meaningful ways.
- Once projects have found product-market fit, there is little in the way of long-term support (funding or otherwise). COPs (and Ecosystem-level entities) can use tools like CHAOSS to surface certain types of issues (low maintainer growth, time to PR close as a way to engage new contributors) and address those before it is too late. Maintainer burnout is another growing problem that needs attention.
- Security issues are important. While we have not seen major security issues in scientific open source (compared to the larger OSS community), it is still important to stay on top of CVEs and use CI/CD more extensively. It would also be a good idea to keep an eye on non-technical threats, like bad actors, and poor governance.
- If all else fails and a project needs to end, it must be done responsibly. This includes notifying all stakeholders (downstream dependencies, users, trainers), providing pointers to comparable alternatives, archiving all code to support reproducibility efforts, and offering enough lead time (See the r-spatial example below).
- The calls to action are:
- If you are a developer, find ways to align with your local ecosystem to coordinate roadmaps and resources, and raise your visibility
- If you’re a COP, find ways to support maintainer burnout, support governance templates, document managing org options, etc.
- Lastly, folks operating at the level of an ecosystem (funders, foundations, training partners, infrastructure providers) can also pick and address one or more of these issues at scale.
- The Journal of Open Source Software
- Paper about the design of The Journal of Open Source Software (JOSS): Journal of Open Source Software (JOSS): design and first-year review
- Software Bill of Materials
- Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Dan Katz and Arfon Smith
- Grobid,
- Softcite dataset: A dataset of software mentions in biomedical and economic research publications Caifan Du, Johanna Cohoon, Patrice Lopez, James Howison
- Mining Software Entities in Scientific Literature: Document-level NER for an Extremely Imbalance and Large-scale Task Patrice Lopez, Caifan Du, Johanna Cohoon, Karthik Ram, James Howison
- Pathways Enabling Open Source Ecosystems training program part of the NSF’s TIP directorate POSE program
- Software Sustainability: Research and Practice from a Software Architecture Viewpoint (PDF)
- Report: Sustainability in Research-Driven Open Source Software - Danielle Robinson and Joe Hand
- Perry Greenfield’s PyData Keynote on How Python Found its way Into Astronomy covers some of the transition from IRAF → AstroPy
- Working in Public: The Making and Maintenance of Open Source Software 📙
- Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure This is an older report and has some of the ideas from the book in case you can’t get a hold of that one
- CHAOSS Community stands for Community Health Analysis for Open-Source Software.
- Various links from The rOpenSci Project
- r-universe
- CRAN to Git: How to use r-universe to search across package ecosystems.
- Strategy for Culture Change Nicholas Tierney and I adapted this concept in a paper about data sharing. Here’s a direct link to our figure.
- On the topic of software being retired, here is a series of posts (part 1, part 2, part 3) about rgdal and other tools in the R spatial suite being retired in fall 2023 because the core maintainer is retiring and the stack has been replaced by something better. Note all the steps being taken to retire this responsibly
“ We describe where their functionality will go, what package maintainers can or should do, and which steps we will take to minimize the impact on dependent packages and on reproducibility in general.”
Ram, Karthik. (2023, April 12). How to enable and sustain thriving Open Source Ecosystems (OSE). Zenodo. https://doi.org/10.5281/zenodo.7822917
This talk was greatly improved by discussions with Arfon Smith, James Howison, Sean Goggins, Patrice Lopez, and Abby Cabunoc Mayes.
Questions or comments are welcome at karthik dot ram at gmail.