External Library Management
At LinkedIn, when our engineers create software, there is often a need to leverage some of the great work done by the open source community outside of LinkedIn. In our continuous delivery parlance we refer to these assets as External Libraries. It's great to be able to stand on the shoulders of the community and use their wonderful work to build quality products. But this process has challenges, especially when you are building a predictable and reliable experience for your users. Last year, the removal of leftpad from the npm registry caused developers around the world to suffer an unnecessary software outage. There are many other stories about software being published that puts and entire organization at risk because of buggy code, security holes, or licensing challenges.
Another important aspect of maintaining a high quality CI/CD pipeline is ensuring that each build reproducible. Having reproducible builds removes entropy in the build process, allowing us to create the same piece of software from any given point in time. This not only requires having a snapshot of the code needed to create an artifact, but also the exact dependencies, including external ones, used during the build process.
Handling external dependencies
External dependencies can be dealt with in couple of different ways – include the library of interest in the source repository directly or download it on demand during the build process. Both of these approaches are problematic.
Checking in external binary dependencies in the source repository can cause repository size to increase dramatically, causing performance problems during checkouts and commits. Binary dependencies do not generate diffs between versions well. In short, version control systems were not built to handle many large binary files that are frequently updated. Checking-in dependencies can also make it difficult to determine the source and version of the dependency. The lack of information makes auditing difficult.
The challenge with downloading during the build process is that one does not have control over the repositories where the dependencies are pulled from. Libraries can be removed causing unexpected build failures or worse; they can be compromised causing legal and security risks
At LinkedIn, we tackle this challenge by importing external library dependencies into our own repository through the External Library Request (ELR) process. During this process we import and host each requested dependency, all of its metadata and all of its transitive dependencies. ELR process provides us with the ability to audit each dependency and remove reliance on external repositories. Having all our dependencies in one place also makes it easy to run security tools and license checks on an ongoing basis, thus preventing any legal or security risks that may manifest itself during our development process. It is instrumental in enabling our ability to generate reproducible builds.
The ELR process
The External Library Request (ELR) process allows LinkedIn engineers to bring in open source libraries in a safe and reliable manner as a part of their software development process. The goal of the ELR process is to manage external libraries for all of the major languages in use at LinkedIn. To date we have the ability to manage Java, JavaScript, C++, Python, and iOS libraries.
The ELR process consists of a fairly straight forward list of steps
- The process starts with the user requesting the libraries of interest in terms of unique identifiers which help to unambiguously identify the artifact.
- For JavaScript libraries we support npm packages. The user needs to specify the <package name> and the <version> and the npm package.
- For Java libraries users need to provide Maven co-ordinate, specified as <groupId:artifactId:version>, of the package.
- For Python user needs to provide PyPi <package name> and <version> of the package to install.
- IOS libraries we support Cocoapods. The user needs to specify a <pod name> and the <version> to install.
- Next step consists of locating the artifact to be imported. Java libraries typically come from JCenter, Python libraries come from PyPi, iOS cocoapods come from GitHub, and JavaScipt libraries from global npm registry.
- Using the metadata retrieved for the specific package the full dependency tree of the requested library and see what needs to be imported into our repository. Some of the libraries may already be there in the repository.
- Next it is necessary to ensure that the "terms of use" for all the libraries to be imported satisfy the legal requirements of the organization.
- Once we determine that there are no legal concerns about the library, the import process may begin. This part of the process consists of following the dependency tree and downloading the relevant libraries and creating the appropriate metadata files for our repository.
- The final step is to upload the libraries and their metadata to our internal repository
In the rest of this article we will mostly focus on handling of external Java libraries. The process for handling other languages are generally similar even though specific details may vary.
Automating the workflow
The ELR service is a self service web portal, that automates this entire process to provide users with a seamless capability to request the external libraries they need to incorporate in their products. The workflow is as shown below.
1. Request submission: User uses the web portal to submit request for a library to be imported.
2. Resolution phase: The metadata is fetched and the Dependency tree is calculated for a given library and cross-checked with existing libraries. A complete list of libraries to import is created.
3. Validation phase: Several validation tools are run on the list of libraries and their metadata.
- Legal check: The metadata is scanned to ensure that the software license of the requested libraries are compliant with the LinkedIn's legal requirements.
- Security check: Security checks are run against the downloaded libraries to detect any potential vulnerabilities.
- Well-formedness check: It is required that each dependency be mapped to a specific or range of versions for a given library. For example in npm we block libraries that depend on versions at a specific source control revision. This is needed so we don’t create dependencies on remote source control repositories. In our experience depending on an external repository introduces uncertainty into the build system by creating a dependency on an external service.
4. Manual intervention: If any of the validation steps fail then the concern needs to manually resolved with appropriate stakeholders.
5. Confirmation phase: Once the set of libraries has passed validation the requester is notified and the library is uploaded to our internal repository.
Resolving dependencies & Locating libraries to download
Any given language or platform can use multiple build systems and repositories from which dependencies are downloaded. Our process optimizes for Gradle, the primary build system in use at LinkedIn. We make extensive use of Gradle to build almost all of our software products.
Resolving dependencies & Locating libraries to download
Any given language or platform can use multiple build systems and repositories from which dependencies are downloaded. Our process optimizes for Gradle, the primary build system in use at LinkedIn. We makes extensive use of Gradle to build almost all of our software products.
When building projects for the JVM, the most popular external repository to include in a Gradle build script is JCenter. JCenter provides the largest Java repository in terms of number of libraries. We also include other repositories commonly used by developers at LinkedIn like Google’s Maven repository. Developers can request libraries from a list of repositories and have them published into our internal repository.
Java dependencies in these repositories are usually stored as Maven POM files. These files include metadata such as the library’s own dependencies. When a library is requested we derive the dependency tree by traversing the library's dependencies recursively in order to find the complete list of transitive dependencies. We then cross reference the libraries in this tree with a list of available libraries in our internal repository. Those transitive dependencies that are missing are then imported to our internal repository. We use a custom Gradle plugin to perform these actions.
When building projects for npm, we follow a similar approach. Npm hosts their own repository for storing dependencies. Npm metadata is stored in a JSON file called package.json. The Npm repository hosts libraries and their corresponding metadata. The ability to resolve the dependency tree is based on the npm command line tool.
Managing library metadata
At LinkedIn we use ivy as a dependency specification system to share code between our many projects. We use Apache Ivy to leverage the concept of configurations. Configurations allow you to map artifacts into different groups. Maven only allows for six scopes (compile, provided, runtime, test, system, and import). Using Ivy lets us create custom configurations to store artifacts such as the source bundle, documentation, and test dependencies.
In order to make our internal repository consistent we convert dependencies specified in POM to Ivy. Unfortunately Gradle does not provide an out of the box solution for converting a Maven POM file to an Ivy file. Ant, on the other hand, does provide a task for this operation and Gradle's own extensive API does include support for publishing dependencies to an Ivy repository. Leveraging these we’ve created a custom Gradle plugin to convert Maven dependencies to Ivy.
Summary
In a large development organization with as many engineering teams as LinkedIn has it is inevitable that the teams would like to leverage some of the fantastic software assets that exist out in the open source. But at the same it is essential that we protect the teams and their work from various risks that come from using unvetted software. The External Library Request Service tries to do exactly that. Before this self service option was in place, the process of importing libraries into LinkedIn was manual, time consuming and error prone leading to loss of productivity. Deploying the ELR service has addressed that challenge for 95+% of the ELR requests and we continue to close that gap by identifying techniques to resolve the trickier requests. Finally, I would like to thank Darius Archer, Chong Wang, Jarek Rudzinski, Marius Seritan, and Omid Monshizadeh for their help and contribution to the ELR project at LinkedIn.