Description
This issue is the landing page for all things compilation speed related. We define the most important usage scenarios, as seen by the @rust-lang/compiler team and the @rust-lang/wg-compiler-performance working group. Benchmarks based on these scenarios are then used as a metric for how well the compiler is doing, compile-time-wise. We then establish a model of the compilation process and use it to estimate the effect of optimization categories on the various usage scenarios. Finally, we list concrete optimization ideas and initiatives, relate them to categories and usage scenarios, and link to their various specific GH issues.
Usage Scenarios
We track compiler performance in a number of use cases in order to gauge how compile times in the most important and common scenarios develop. We continuously measure how the compiler performs on perf.rust-lang.org. #48750 defines which concrete benchmarks go into measuring each usage scenario. In the following, "project" means a medium to large codebase consisting of dozens of crates, most of which come from crates.io.
-
FROM-SCRATCH - Compiling a project from scratch (P-high)
This scenario occurs when a project is compiled for the first time, during CI builds, and when compiling after runningcargo clean
,make clean
or something similar. The targeted runtime performance for builds like this is usually "fast enough", that is, the compiled program should execute with performance that does not hinder testing but it is not expected to have absolute peak performance as you would expect from a release build. -
SMALL-CHANGE - Re-Compiling a project after a small change (P-high)
This scenario is the most common during the regular edit-compile-debug cycle. Low compile times are the top-most priority in this scenario. The compiled runtime performance, again, should be good enough not to be an obstacle. -
RLS - Continuously re-compiling a project for the Rust Language Server (P-high)
This scenario covers how the Rust compiler is used by the RLS. Again, low compile times are the top-most priority here. The only difference to the previous scenario is that no executable program needs to be generated at all. The output here is diagnostics and the RLS specificsave-analysis
data. -
DIST - Compiling a project for maximum runtime performance (P-medium)
This scenario covers the case of generating release artifacts meant for being distributed and for reflecting runtime performance as perceived by end users. Here, compile times are of secondary importance -- they should be low if possible (especially for running benchmarks) but if there is a trade-off between compile time and runtime performance then runtime performance wins.
Sometimes we also see people "measuring" Rust's compile times by compiling a "Hello World" program and comparing how long that takes in other languages. While we do have such a benchmark in our suite, we don't consider it one of the important usage scenarios.
A Model of the Compilation Process
The compiler does lots of different things while it is compiling a program. Making any one of those things faster might improve the compile time in one of the scenarios above or it might not. This section will establish a few categories of tasks that the compiler executes and then will map each category to the scenarios that it impacts.
Compilation Phases / Task Categories
The compiler works in four large phases these days:
-
Pre-Query Analysis (pre-query) -- This roughly includes parsing, macro expansion, name resolution, and lowering to HIR. The tasks in this phase still follow the bulk processing paradigm and the results it produces cannot be cached for incremental compilation. This phase is executed on a single thread.
-
Query-based Analysis (query) -- This includes type checking and inference, lowering to MIR, borrow checking, MIR optimization, and translation to LLVM IR. The various sub-tasks in this phase are implemented as so-called queries, which are computations that form a DAG and the results of which can be tracked and cached for incremental compilation. Queries are also only executed if their result is requested, so in theory this system would allow for partially compiling code. This phase too is executed on a single thread.
-
LLVM-based optimization and code generation (llvm) -- Once the compiler has generated the LLVM IR representation of a given crate, it lets LLVM optimize and then translate it to a number of object files. For optimized builds this usually takes up 65-80% of the overall compilation time. This phase can be run on multiple threads in parallel, depending on compiler settings. Incremental compilation allows to skip LLVM work but is less effective than for queries because caching is more coarse-grained.
-
Linking (link) -- Finally, after LLVM has translated the program into object files, the output is linked into the final binary. This is done by an external linker which
rustc
takes care of invoking.
Note that this describes the compilation process for a single crate. However, in the scenarios given above, we always deal with a whole graph of crates. Cargo will coordinate the build process for a graph of crates, only compiling crates the code of which has changed. For the overall compile time of a whole project, it is important to note that Cargo will compile multiple crates in parallel, but can only start compiling a crate once all its dependencies have been compiled. The crates in a project form a directed acyclic graph.
Using Task Categories To Estimate Optimization Impact On Scenarios
From the description above we can infer which optimizations will affect which scenarios:
- Making a (pre-query) task faster will affect all scenarios as these are unconditionally executed in every compilation session.
- Making a (query) task faster will also affect all scenarios as these are also executed in every session.
- Making (llvm) execute faster will have most effect on FROM-SCRATCH and DIST, and some effect on SMALL-CHANGE. It will have no effect on RLS.
- Making (link) faster helps with FROM-SCRATCH, DIST, and SMALL-CHANGE, since we always have to link the whole program in these cases. In the SMALL-CHANGE scenario, linking will be a bigger portion of overall compile time than in the other two. For RLS we don't do any linking.
- Turning a (pre-query) task into a (query) task will improve SMALL-CHANGE and RLS because we profit from incremental compilation in these scenarios. If done properly, it should not make a difference for for the other scenarios.
- Reducing the amount of work we generate for (llvm) will have the same effects as making (llvm) execute more quickly.
- Reducing the time between the start of compiling a crate and the point where dependents of that crate can start compiling can bring superlinear compile-time speedups because it reduces contention in Cargo's parallel compilation flow.
- Reducing the overhead for incremental compilation helps with SMALL-CHANGE and RLS and possibly FROM-SCRATCH.
- Improving incr. comp. caching efficiency for LLVM artifacts helps with SMALL-CHANGE and possibly FROM-SCRATCH, but not DIST, which does not use incremental compilation, and RLS, which does not produce LLVM artifacts.
- Improving generation of save-analysis data will help the RLS case, while this kind of data is not produced during any of the other scenarios.
Concrete Performance Improvement Initiatives
There are various concrete initiatives of various sizes that strive to improve the compiler's performance. Some of them are far along, some of them are just ideas that need to be validated before pursuing them further.
Incremental compilation
Work on supporting incremental re-compilation of programs has been ongoing for quite a while now and it is available on stable Rust. However, there are still many opportunities for improving it.
- Status: "version 1.0" available on stable.
- Current progress is tracked in Tracking Issue for Incremental Compilation #47660.
- Affected usage scenarios: SMALL-CHANGE, RLS
Query parallelization
Currently to compiler can evaluate queries (which comprise a large part of the non-LLVM compilation phases) in a single-threaded fashion. However, since queries have a clear evaluation model which structures computations into a directed acyclic graph, it seems feasible to implement parallel evaluation for queries at the framework level. @Zoxc even has done a proof-of-concept implementation. This would potentially help with usage scenarios since all of them have to execute the query-part of compilation.
- Status: Preliminary work in progress, experimental
- Current progress is tracked in Query Parallelization Tracking Issue #48685
- Affect usage scenarios: FROM-SCRATCH, SMALL-CHANGE, RLS, DIST
MIR-only RLIBs
"MIR-only RLIBs" is what we call the idea of not generating any LLVM IR or machine code for RLIBs. Instead, all of this would be deferred to when executables, dynamic libraries, or static C libraries are generated. This potentially reduces the overall workload for compiling a whole crate graph and has some non-performance related benefits too. However, it might be detrimental in some other usage scenarios, especially if incremental compilation is not enabled.
- Status: Blocked on query parallelization
- Current progress is tracked in Tracking issue for MIR-only RLIBs #38913
- Affected usage scenarios: FROM-SCRATCH, SMALL-CHANGE, DIST
ThinLTO
ThinLTO is an LLVM mode that allows to perform whole program optimization in a mostly parallel fashion. It is currently supported by the Rust compiler and even enabled by default in some cases. It tries to reduce compile times by distributing the LLVM workload to more CPU cores. At the same time the overall workload increases, so it can also have detrimental effects.
- Status: available on stable, default for optimized builds
- Current progress is tracked in Tracking issue for enabling multiple CGUs in release mode by default #45320
- Affected usage scenarios: FROM-SCRATCH, DIST
Sharing generic code between crates
Currently, the compiler will duplicate the machine code for generic functions within each crate that uses a specific monomorphization of a given function. If there is a lot of overlap this potentially means lots of redundant work. It should be investigated how much work and compile time could be saved by re-using monomorphizations across crate boundaries.
- Status: Experimental implementation in Allow for re-using monomorphizations in upstream crates. #48779
- Current progress is tracked in Experiment with sharing monomorphized code between crates #47317
- Affected usage scenarios: FROM-SCRATCH, SMALL-CHANGE, DIST
Sharing closures among generic instances
We duplicate code for closures within generic functions even if they do not depend on the generic parameters of the enclosing function. This leads to redundant work. We should try to be smarter about it.
- Status: Unknown
- Current progress is tracked in Instantiate fewer copies of a closure inside a generic function #46477
- Affected usage scenarios: FROM-SCRATCH, SMALL-CHANGE, DIST
Perform inlining at the MIR-level
Performing at least some amount of inlining at the MIR-level would potentially reduce the pressure on LLVM. It would also reduce the overall amount of work to be done because inlining would only have to be done once for all monomorphizations of a function while LLVM has to redo the work for each instance. There is an experimental implementation of MIR-inlining but it is not production-ready yet.
- Status: Experimental implementation exists on nightly, not stable or optimized yet
- Current progress is tracked in change how MIR inlining handles cycles #43542 (kind of)
- Affected usage scenarios: FROM-SCRATCH, SMALL-CHANGE, DIST
Provide tools and instructions for profiling the compiler
Profiling is an indispensable part of performance optimization. We should make it as easy as possible to profile the compiler and get an idea what it is spending its time on. That includes guides on how to use external profiling tools and improving the compiler internal profiling facilities.
- Status: Idea
- Current progress is tracked in (nowhere yet)
- Affect usage scenarios: FROM-SCRATCH, SMALL-CHANGE, RLS, DIST
Build released compiler binaries as optimized as possible
There is still headroom for turning on more optimizations for building the rustc
release artifacts. Right now this is blocked by a mix of CI restrictions and possibly outdated restrictions for when build Rust dylibs.
- Status: In progress
- Current progress is tracked in Build released compiler artifacts as optimized as possible #49180
- Affect usage scenarios: FROM-SCRATCH, SMALL-CHANGE, RLS, DIST
Feel free to leave comments below if there's anything you think is pertinent to this topic!