-
Notifications
You must be signed in to change notification settings - Fork 937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Class-based dependency tracking in name hashing algorithm #1104
Comments
The sbt/sbt#1104 discusses in depth what's needed to implement tracking of dependencies at class level.
The sbt/sbt#1104 discusses in depth what's needed to implement tracking of dependencies at class level.
I'd like to share notes on issues and design choices I encountered when prototyping phase 1. Referring (naming classes)Existing incremental compiler uses class names to refer to targets of external dependencies. External dependencies, are the ones where the target of a dependency is in a source file outside of a given project. In such case, the dependency is expressed as a src file -> class name. The name of a class is stored as a string. The string is created from a symbol seen after the flatten phase and it includes its package and its owners. For example, if we have: package x.y
class A {
object B {
class C
}
} Then we haven names
I was wondering if the lack of distinction between companions would lead to a problem of two distinct classes clashing with each other: class A {
class X
}
object A {
class X
} but this is disallowed by the Scala compiler. In my old prototype, I used names of symbols as seen by the typer phase (before flattening) and I tracked all classes declared in a given source file by I can't see why this couldn't work at the moment so I'll try sticking to an existing Preserving the old incremental compilation algorithmI looked into preserving the old incremental compilation algorithm. It's a lot of work because relations in the old algorithm are expected to be mostly of type Switching to class-based invalidationSwitching to class-based invalidation is not enough to fix #2320. In addition to tracking dependencies at class level we also need to know which classes have changed. The full implementation of that tracking is the topic of phase 2. In phase 1, I'll just add separate tracking of hashes for each top level classes defined in source. This is the minimal change needed to fix #2320 and prove that phase 1 is mostly completed. |
I'm adding more notes taken while prototyping class-based dependency tracking. Naming classesI discussed how to refer (name) classes in the comment above. I was wondering if naming classes as seen after the flatten phase is ok. However, the API phase that extracts class structures (and their names) uses names as seen before the pickler phase (and the flatten phase) so we run into an inconsistency. I think either choice (pickler or after flatten) is fine but we must be consistent throughout entire incremental compiler algorithm. I'm leaning towards sticking to names of Symbols at pickler phase to be consistent with the rest of API extraction logic that sees trees before pickler phase. I'll try to switch everything in incremental compiler to names seen at pickler phase and report back. Local and anonymous classesConsider the example: class A
class A2
class B {
def foo = {
class Foo(a: A)
}
{
class Bar extends A2
}
class Inner(a: A)
}
I think what we should do is to attribute those dependencies to enclosing class of My example above was a bit simplistic because local classes were defined within a top-level class. Let's look what do we do in a general case of arbitrary nesting. We have a symbol The intuition for this algorithm is that we want to attribute dependencies to an inner most class or object that we can make a full name for. I haven't done any work to implement it. I'll add a task to the list in this ticket. |
This commit implements invalidation of dependencies at class level both for inheritance and memberRef dependencies. However, information that determines api changes is still tracked mostly at source file level. Name hashes are tracked at source level (classes considered for invalidation are mapped back to source files before name hash lookup). For that reason, sbt#2319 is still not fixed. Api hashes are tracked per-class but only for top-level classes. It's not a final design but a quick hack to prove that class-based invalidation is working correctly. It was enough to scenario described in sbt#2320 fixed. This is shown in `class-based-inheritance` scripted test. The final design of api hashing will be that all publicly accessible classes will get their own separate hash. The sealed test is marked pending as described in notes of sbt#1104. Other thing to note is that we're getting close to being able to strip abstractions that deal with the fact that dependencies were tracked either as Relation[File, File] (for internal dependencies) or Relation[File, String] (for external dependencies). We track all dependencies uniformly as Relation[String, String] now.
I've done a bit more research on naming classes. Here're my notes Naming classes and class filesIn the implementation of the incremental compiler recording of produced class files, their class names and binary dependencies are all mixed together. The incremental compiler needs a function The incremental compiler also needs a function For both of those functions, the className was a name seen after the flatten phase. It's important that the name is consistent with the name of a class file corresponding to it because conversion between names and class files is done in several places. Once we have a srcFile we can grab a corresponding api object and figure out whether an external (from a different project) changes affect us. This is done in Let's see how we can solve this. My current thinking is that I'll stick to flat names for everything binary-related (class files, binary dependencies, classpath lookups, etc.) and I'll stick to pickler names for everything we're interested in reasoning about using the Scala model (declared classes, source dependencies). For the external dependencies (from other projects) we need mapping from binaryClassName (after flatten phase) => srcClassName (at the pickler phase). We'll need a new relation This change might be tricky enough to implement that I'm creating a new task for it. My guess is that I'll have to do some work to introduce a clear distinction between binary and src names for classes. |
This commit implements invalidation of dependencies at class level both for inheritance and memberRef dependencies. However, information that determines api changes is still tracked mostly at source file level. Name hashes are tracked at source level (classes considered for invalidation are mapped back to source files before name hash lookup). For that reason, sbt#2319 is still not fixed. Api hashes are tracked per-class but only for top-level classes. It's not a final design but a quick hack to prove that class-based invalidation is working correctly. It was enough to scenario described in sbt#2320 fixed. This is shown in `class-based-inheritance` scripted test. The final design of api hashing will be that all publicly accessible classes will get their own separate hash. The sealed test is marked pending as described in notes of sbt#1104. Other thing to note is that we're getting close to being able to strip abstractions that deal with the fact that dependencies were tracked either as Relation[File, File] (for internal dependencies) or Relation[File, String] (for external dependencies). We track all dependencies uniformly as Relation[String, String] now.
More notes on names. Distinction of objects and classes in namesIn my original comment on naming classes I said there would be no distinction between names of objects and classes. If you have a pair of companions: package abc
class A
object A then there's just one name for both of them: Couldn't we just append This test settles the question whether we should distinguish between dependencies on objects and classes. The answer is no. |
Ah, that makes sense. Very interesting, thanks. On Sat, 23 Jan 2016 at 15:10 Grzegorz Kossakowski notifications@github.com
|
I've spent some time thinking and prototyping handling of top-level imports. This is one of the tasks outlined in the description of the ticket (for phase 1). Handling of top level importsLet's start with an example motivating the problem: // A.scala
package foo
class A
// B.scala
import foo.A
class B At the moment, the dependency introduced by the import node is: And all of that effort would go into supporting unused imports. If the import is used, the dependency will be accounted for a class that is using an imported type or term. I think the return we get on the investment is not great. What can we do? I think we could cheat a little bit and pretend that top level imports are actually imports defined inside a top level class. Our example would be treated as it was: // A.scala
package foo
class A
// B.scala
class B {
import foo.A
} And this way we would just have // A.scala
package foo
class A
// B.scala
// we don't see imported A here
class B(a: A) {
import foo.A
} That's why it's cheating. However, it doesn't matter for the dependency tracking. It's enough that we record the dependency anywhere that will cause My proposal is that we collect all dependencies introduced by top level imports and assign them to the first top level class/object/trait declared in a source file. What if the source file contains just top level imports and no class/object/trait declarations? We issue a warning that incremental compiler doesn't track dependencies correctly in such case. This solution will be straightforward to implement (it's a simple tweak to dependency extraction logic) and will handle the majority of cases where you have both unused imports and classes declared in a file. If this turns out to be problematic, the more principled (and much more complex to implement) solution can be introduced later on. |
This commit implements invalidation of dependencies at class level both for inheritance and memberRef dependencies. However, information that determines api changes is still tracked mostly at source file level. Name hashes are tracked at source level (classes considered for invalidation are mapped back to source files before name hash lookup). For that reason, sbt#2319 is still not fixed. Api hashes are tracked per-class but only for top-level classes. It's not a final design but a quick hack to prove that class-based invalidation is working correctly. It was enough to scenario described in sbt#2320 fixed. This is shown in `class-based-inheritance` scripted test. The final design of api hashing will be that all publicly accessible classes will get their own separate hash. The sealed test is marked pending as described in notes of sbt#1104. Other thing to note is that we're getting close to being able to strip abstractions that deal with the fact that dependencies were tracked either as Relation[File, File] (for internal dependencies) or Relation[File, String] (for external dependencies). We track all dependencies uniformly as Relation[String, String] now.
The `binaryClassName` relation maintains mapping between source and binary class names. This mapping is needed to map binary dependencies back to source dependencies in case of separate compilation (where we see dependencies on class files). You can see that mapping being used in `binaryDependency` method implementation of Analysis callback. Previously, we would map class file to a source file it was produced from and then assume that dependency is on any (all) of classes declared in that class. Introduction of `binaryClassName` lets us map dependency back to source class name directly and remove that imprecision of dependency tracking. We maintain mapping between source and binary class names just for non-local classes. Check this sbt#1104 (comment) for the discussion of local and non-local classes. We also rework tracking of products in Analysis by introducing explicitly the concept of local and non-local products corresponding to local and non-local classes. This helps us to clarify for which classes we track source and binary class names.
It's a good idea to cheat a bit for import tracking, I agree. I wonder about putting them in some existing top-level class. Maybe we should have a synthetic |
I thought about introducing a synthetic class. However, I suspect that implementation would run into subtle issues. For example, the current implementation assumes that there's 1-n relation between a declared class in source file and class files. I'd have to chase places where it matters and patch them. On the other hand, the proposal I described above was a simple patch to Dependency extraction logic. |
Ok, i was thinking it could be confusing to use an arbitrary class (from potentially many) in the compilation unit to as the one that has the "compilation unit stuff". |
This commit implements invalidation of dependencies at class level both for inheritance and memberRef dependencies. However, information that determines api changes is still tracked mostly at source file level. Name hashes are tracked at source level (classes considered for invalidation are mapped back to source files before name hash lookup). For that reason, sbt#2319 is still not fixed. Api hashes are tracked per-class but only for top-level classes. It's not a final design but a quick hack to prove that class-based invalidation is working correctly. It was enough to scenario described in sbt#2320 fixed. This is shown in `class-based-inheritance` scripted test. The final design of api hashing will be that all publicly accessible classes will get their own separate hash. The sealed test is marked pending as described in notes of sbt#1104. Other thing to note is that we're getting close to being able to strip abstractions that deal with the fact that dependencies were tracked either as Relation[File, File] (for internal dependencies) or Relation[File, String] (for external dependencies). We track all dependencies uniformly as Relation[String, String] now.
The `binaryClassName` relation maintains mapping between source and binary class names. This mapping is needed to map binary dependencies back to source dependencies in case of separate compilation (where we see dependencies on class files). You can see that mapping being used in `binaryDependency` method implementation of Analysis callback. Previously, we would map class file to a source file it was produced from and then assume that dependency is on any (all) of classes declared in that class. Introduction of `binaryClassName` lets us map dependency back to source class name directly and remove that imprecision of dependency tracking. We maintain mapping between source and binary class names just for non-local classes. Check this sbt#1104 (comment) for the discussion of local and non-local classes. We also rework tracking of products in Analysis by introducing explicitly the concept of local and non-local products corresponding to local and non-local classes. This helps us to clarify for which classes we track source and binary class names.
Implements a strategy for recording dependencies introduced by top level imports by assigning those dependencies to the first top level class. In case there are top level imports but no top level class/trait/object defined in a compilation unit, a warning is issued. The rationale for this strategy can be found at: sbt#1104 (comment) Add an unit test covering different cases of top level imports (e.g. defined in nested packages). Mark the scripted test source-dependencies/import-class as passing after a small modification of adding a top level class.
Add a pending scripted test that checks if dependencies introduced by inheritance by local classes are handled properly. Check the comment in the test and this discussion: sbt#1104 (comment) for more details.
Dependencies introduced by local classes require special handling because we track only non local classes (that can be referred from other files) in dependency relations. To overcome this problem, dependencies introduced by a local class are recorded as introduced by an outer class that is non local. However, this introduces a new problem with dependencies introduced by inheritance. We don't want local inheritance dependencies to cause a transitive invalidation of all classes that inherit from the outer class containing the local class. Check the comment in Relations.scala this patches introduces or follow the discussion of this problem at: sbt#1104 (comment) To capture the subtlety of inheritance dependencies from local classes, we introduce `LocalDependencyByInheritance` case to `DependencyContext` enum. TestCallback has been modified to return extracted local inheritance dependencies and a test in DependencySpecification has been updated accordingly. The Dependency phase has been reworked to handle local classes properly by mapping dependencies to outer, non local classes.Check the implementation for details of the mapping. It's worth mentioning that mapping is implemented as an amortized O(1) lookup so this change doesn't affect performance of extraction phase negatively. The invalidation logic has been modified to take into account inheritance dependencies introduced by local classes. The patch is small because the invalidation logic is very straightforward: we invalidate local inheritance dependencies non-transitively and we do not apply name hashing dependency pruning. Lastly, we mark local-class-inheritance scripted test as passing.
I have great news! The phase 1 of the class-based dependency tracking is implemented. All tasks are marked as completed. I pushed all of my changes and they are available for viewing here: https://github.com/sbt/sbt/compare/0.13...gkossakowski:class-based-dependencies?expand=1 I'm going to test the changes from the phase 1 on a few real projects like scala/scala, scalatest, specs2 to catch some bugs. In next few days, I'll add more notes on issues encountered while implementing changes from the phase 1. |
Cool! Could you provide a quick update? |
I'm adding a note on implementation challenge of the notion of local classes. Local classes (implementation challenge)The notion of local classes has been discussed before in this ticket. I'm going to discuss an issue with implementing that notion that makes the code of incremental compiler more complicated because two different phases have to communicate with each other. Whether a class is local or not can be determined at phases after typechecking and before lambdalift that moves classes around and destroys the original nesting structure. E.g. the We need to know whether a class is local or non-local for other reason: to find out how it's class files should be recorded. We have a notion of both local and non-local products. The information about produced class files (e.g. their names) is available at the end of compilation pipeline and its extracted by the This solution works but is ugly. Now we have to worry about the lifecycle of the Map and make sure that its properly populated so it can serve both Dependency and Analyze phases. It's ugly for another reason: there's already a similar map maintained for the backend so it can generate proper I decided to add the note on this implementation detail to show that destructive changes like rewriting owner chains can be problematic in surprising contexts. It's good to keep this in mind when new design ideas are floated around (e.g. Dotty). |
This commit changes how tracking of API data structures is done within the incremental compiler. It changes how APIs are passed around and stored but doesn't change the behavior of the incremental compiler. Here's a summary of what has changed and what's still being tracked at the source file level: - APIs are tracked per class name in a newly introduced Companions data structure; incremental compiler always considers the pair of companions from now on - only APIs for top level classes are extracted at the moment; invalidation is still imprecise - Used names are tracked per source file - Name hashes are tracked per top-level class (they're part of AnalyzedClass data structure) Companion class and object have to be considered as a pair because they're given exactly the same name in the incremental compiler. The idea of naming classes and objects separately has been discussed and rejected here: sbt#1104 (comment) APIs of companions are linked together in AnalysisCallback. The ExtractAPI compiler phase continues to extract apis of classes and objects separately. More on those changes below. Most changes in this patch are direct consequences of the changes in the `interface/other` file. The `Source` has been replaced by the `AnalyzedClass`. The AnalyzedClass carries an extracted api of the (class, companion object) pair (stored as `Companions`) plus some meta data about the pair (e.g. hash sum of its api representation). Source used to carry both hash sum of the source file contents (hash of the text) and a hash of the api. The hash of the source file has been introduced to shortcut equality checking before api hashing has been introduced. Now it's redundant so it's removed. The `SourceAPI` used to carry information about packages declared in a source file but this information wasn't used in the incremental compiler so its tracking is removed. I also removed an ugly looking `_internalOnly_` prefixes. The changes in this branch are not binary or source compatible so it's a good opportunity to perform some cleanups. AnalysisCallback has a new method `startSource`. It's needed because we want to track sources that do not declare any classes and for which there's no `api` call. The `api` method takes `ClassLike` as an argument and can be called multiple times per a single source file. The implementation of the AnalysisCallback has been affected by the switch from Source to AnalyzedClass tracking the most. The main change here is that AnalysisCallback is responsible for pairing apis of a companion class and a companion object. It stores apis of classes and objects separately and does the same of their name hashes. Right before persisting all that information it puts both apis into Companions data structure and merges name hashes for a class and its companion object. Merging is performed only when both class and object with the same name are declared. It's worth noting why we cannot just compute name hashes once we have a class and its companion object available instead of merging precomputed name hashes. We can't do that because APIs are minimized and only their hashes are stored so names and signatures of members are lost at this point. We have to computer name hashes before minimization but then we don't have both of companions available yet. For that reason `NameHashing.merge` operation has been introduced that performs a straightforward merge. NameHashingSpecification provides a basic test for its properties. The incremental invalidation algorithm has been affected by switching to class level api tracking. As a consequence, most of invalidation is happening at class level now and only right before compilation class names are mapped to source files that declare them. To emphasize that point, recompilation of classes has been refactored to its own method `recompileClasses` in the IncrementalCommon. However, we had two make two exceptions to invalidation performed on class level: 1. Sources that just has been added and we don't know their declared classes yet so we have to scheduled them for compilation as is. 2. Sources that do not declare any classes have to be scheduled for recompilation directly too This is the reason why `cycle` takes both invalidated classes and modified srcs as inputs and why `invalidateInitial` computes both. After the first iteration of `cycle`, the set of modified sources becomes empty and the remaining of invalidation is performed at the class level only. Here's a list of changes I think are worth highlighting either for clarity or to make a point: - SameAPI dropped some old, unused code from TopLevel and NameChanges classes - APIs do not have any reference to `java.io.File` now, this data structure operates purely on class names now - helpers methods for looking up dependency information from Relations has been changed to work on a class level - version number in TextAnalysisFormat has been bumped - the `inherited_type_params` scripted test has been removed as it looks like not testing anything useful and breaks due to changes to the `APIs` interface - Analyze doesn't store empty apis for Java source files that do not declare any classes; we use `AnalysisCallback.startSource` for tracking - The Scala 2.8-specific test has been dropped. - The test for Ant-style compilation is marked as pending. Supporting of Ant-style compilation is tricky because invalidation is happening at the class level now.
I've done some preliminary testing of class-based dependency tracking to see how it fares. Performance improvementsScalatest (master)Add a method to
Run
Specs2 (master)Add a method to Run
Run
|
@gkossakowski great work! Would love to use it. |
@gkossakowski : using https://github.com/gkossakowski/sbt/commits/class-based-dependencies and scala 2.12 master (scala/scala@ad48e59 right now) I was able to reproduce most of your performance numbers but not the one for
I don't know if this is a regression or if something changed in scala master since you did your tests (the biggest change is probably that the new trait encoding was merged: scala/scala#5003) |
I did some very primitive benchmarking of non-incremental compilation:
My first instinct was to blame the poor results in scala/scala due to some weird pattern in the standard library (like the
|
This is necessary to get sbt#87 to work on top of sbt#86 Compared to sbt#86, this branch: - Does not seem to impact the performance gains in incremental compilation - Makes scala/scala non-incremental compilation 15% slower, but sbt#86 is 60% slower than 0.13.11 (see sbt/sbt#1104 (comment)), so that needs to be fixed first before we analysis this. TODO: - More tests similar to inv-subtyping - Abstract types - Singleton types - Tests to verify if we need to use `symbolsInType` or if `typeSymbol` enough. - ...
Thanks for testing it out! I'll check what's going on with the Regarding the |
Yes, dotty is pretty flat. |
I reproduced the problem with |
I debugged the problem with |
I pushed fixes for local dependencies to my branch. As a work-around for sbt/zinc#88 I applied a patch included at the bottom. Here's the updated result I get with
The work-around for sbt/zinc#88: diff --git a/src/compiler/scala/tools/nsc/typechecker/Macros.scala b/src/compiler/scala/tools/nsc/typechecker/Macros.scala
index bcf9e01..e6b14bd 100644
--- a/src/compiler/scala/tools/nsc/typechecker/Macros.scala
+++ b/src/compiler/scala/tools/nsc/typechecker/Macros.scala
@@ -496,7 +496,7 @@ trait Macros extends MacroRuntimes with Traces with Helpers {
/** Keeps track of macros in-flight.
* See more informations in comments to `openMacros` in `scala.reflect.macros.whitebox.Context`.
*/
- var _openMacros = List[MacroContext]()
+ var _openMacros: List[MacroContext] = Nil
def openMacros = _openMacros
def pushMacroContext(c: MacroContext) = _openMacros ::= c
def popMacroContext() = _openMacros = _openMacros.tail |
This commit implements invalidation of dependencies at class level both for inheritance and memberRef dependencies. However, information that determines api changes is still tracked mostly at source file level. Name hashes are tracked at source level (classes considered for invalidation are mapped back to source files before name hash lookup). For that reason, sbt#2319 is still not fixed. Api hashes are tracked per-class but only for top-level classes. It's not a final design but a quick hack to prove that class-based invalidation is working correctly. It was enough to scenario described in sbt#2320 fixed. This is shown in `class-based-inheritance` scripted test. The final design of api hashing will be that all publicly accessible classes will get their own separate hash. The sealed test is marked pending as described in notes of sbt#1104. Other thing to note is that we're getting close to being able to strip abstractions that deal with the fact that dependencies were tracked either as Relation[File, File] (for internal dependencies) or Relation[File, String] (for external dependencies). We track all dependencies uniformly as Relation[String, String] now.
The `binaryClassName` relation maintains mapping between source and binary class names. This mapping is needed to map binary dependencies back to source dependencies in case of separate compilation (where we see dependencies on class files). You can see that mapping being used in `binaryDependency` method implementation of Analysis callback. Previously, we would map class file to a source file it was produced from and then assume that dependency is on any (all) of classes declared in that class. Introduction of `binaryClassName` lets us map dependency back to source class name directly and remove that imprecision of dependency tracking. We maintain mapping between source and binary class names just for non-local classes. Check this sbt#1104 (comment) for the discussion of local and non-local classes. We also rework tracking of products in Analysis by introducing explicitly the concept of local and non-local products corresponding to local and non-local classes. This helps us to clarify for which classes we track source and binary class names.
Implements a strategy for recording dependencies introduced by top level imports by assigning those dependencies to the first top level class. In case there are top level imports but no top level class/trait/object defined in a compilation unit, a warning is issued. The rationale for this strategy can be found at: sbt#1104 (comment) Add an unit test covering different cases of top level imports (e.g. defined in nested packages). Mark the scripted test source-dependencies/import-class as passing after a small modification of adding a top level class.
Add a pending scripted test that checks if dependencies introduced by inheritance by local classes are handled properly. Check the comment in the test and this discussion: sbt#1104 (comment) for more details.
Dependencies introduced by local classes require special handling because we track only non local classes (that can be referred from other files) in dependency relations. To overcome this problem, dependencies introduced by a local class are recorded as introduced by an outer class that is non local. However, this introduces a new problem with dependencies introduced by inheritance. We don't want local inheritance dependencies to cause a transitive invalidation of all classes that inherit from the outer class containing the local class. Check the comment in Relations.scala this patches introduces or follow the discussion of this problem at: sbt#1104 (comment) To capture the subtlety of inheritance dependencies from local classes, we introduce `LocalDependencyByInheritance` case to `DependencyContext` enum. TestCallback has been modified to return extracted local inheritance dependencies and a test in DependencySpecification has been updated accordingly. The Dependency phase has been reworked to handle local classes properly by mapping dependencies to outer, non local classes.Check the implementation for details of the mapping. It's worth mentioning that mapping is implemented as an amortized O(1) lookup so this change doesn't affect performance of extraction phase negatively. The invalidation logic has been modified to take into account inheritance dependencies introduced by local classes. The patch is small because the invalidation logic is very straightforward: we invalidate local inheritance dependencies non-transitively and we do not apply name hashing dependency pruning. Lastly, we mark local-class-inheritance scripted test as passing.
This commit changes how tracking of API data structures is done within the incremental compiler. It changes how APIs are passed around and stored but doesn't change the behavior of the incremental compiler. Here's a summary of what has changed and what's still being tracked at the source file level: - APIs are tracked per class name in a newly introduced Companions data structure; incremental compiler always considers the pair of companions from now on - only APIs for top level classes are extracted at the moment; invalidation is still imprecise - Used names are tracked per source file - Name hashes are tracked per top-level class (they're part of AnalyzedClass data structure) Companion class and object have to be considered as a pair because they're given exactly the same name in the incremental compiler. The idea of naming classes and objects separately has been discussed and rejected here: sbt#1104 (comment) APIs of companions are linked together in AnalysisCallback. The ExtractAPI compiler phase continues to extract apis of classes and objects separately. More on those changes below. Most changes in this patch are direct consequences of the changes in the `interface/other` file. The `Source` has been replaced by the `AnalyzedClass`. The AnalyzedClass carries an extracted api of the (class, companion object) pair (stored as `Companions`) plus some meta data about the pair (e.g. hash sum of its api representation). Source used to carry both hash sum of the source file contents (hash of the text) and a hash of the api. The hash of the source file has been introduced to shortcut equality checking before api hashing has been introduced. Now it's redundant so it's removed. The `SourceAPI` used to carry information about packages declared in a source file but this information wasn't used in the incremental compiler so its tracking is removed. I also removed an ugly looking `_internalOnly_` prefixes. The changes in this branch are not binary or source compatible so it's a good opportunity to perform some cleanups. AnalysisCallback has a new method `startSource`. It's needed because we want to track sources that do not declare any classes and for which there's no `api` call. The `api` method takes `ClassLike` as an argument and can be called multiple times per a single source file. The implementation of the AnalysisCallback has been affected by the switch from Source to AnalyzedClass tracking the most. The main change here is that AnalysisCallback is responsible for pairing apis of a companion class and a companion object. It stores apis of classes and objects separately and does the same of their name hashes. Right before persisting all that information it puts both apis into Companions data structure and merges name hashes for a class and its companion object. Merging is performed only when both class and object with the same name are declared. It's worth noting why we cannot just compute name hashes once we have a class and its companion object available instead of merging precomputed name hashes. We can't do that because APIs are minimized and only their hashes are stored so names and signatures of members are lost at this point. We have to computer name hashes before minimization but then we don't have both of companions available yet. For that reason `NameHashing.merge` operation has been introduced that performs a straightforward merge. NameHashingSpecification provides a basic test for its properties. The incremental invalidation algorithm has been affected by switching to class level api tracking. As a consequence, most of invalidation is happening at class level now and only right before compilation class names are mapped to source files that declare them. To emphasize that point, recompilation of classes has been refactored to its own method `recompileClasses` in the IncrementalCommon. However, we had two make two exceptions to invalidation performed on class level: 1. Sources that just has been added and we don't know their declared classes yet so we have to scheduled them for compilation as is. 2. Sources that do not declare any classes have to be scheduled for recompilation directly too This is the reason why `cycle` takes both invalidated classes and modified srcs as inputs and why `invalidateInitial` computes both. After the first iteration of `cycle`, the set of modified sources becomes empty and the remaining of invalidation is performed at the class level only. Here's a list of changes I think are worth highlighting either for clarity or to make a point: - SameAPI dropped some old, unused code from TopLevel and NameChanges classes - APIs do not have any reference to `java.io.File` now, this data structure operates purely on class names now - helpers methods for looking up dependency information from Relations has been changed to work on a class level - version number in TextAnalysisFormat has been bumped - the `inherited_type_params` scripted test has been removed as it looks like not testing anything useful and breaks due to changes to the `APIs` interface - Analyze doesn't store empty apis for Java source files that do not declare any classes; we use `AnalysisCallback.startSource` for tracking - The Scala 2.8-specific test has been dropped. - The test for Ant-style compilation is marked as pending. Supporting of Ant-style compilation is tricky because invalidation is happening at the class level now.
Ha ha, easy. Maybe this ticket will eventually make sense to someone other than me. |
This looks awesome. It is a major improvement in the Scala development and I'm excited to see it coming in sbt 1.0. To the best of my knowledge, @gkossakowski is not working on this anymore. Is there anyone else taking care of finishing this ticket off? Is this blocked by something? |
Being tracked as https://github.com/sbt/zinc/milestone/1. |
The `binaryClassName` relation maintains mapping between source and binary class names. This mapping is needed to map binary dependencies back to source dependencies in case of separate compilation (where we see dependencies on class files). You can see that mapping being used in `binaryDependency` method implementation of Analysis callback. Previously, we would map class file to a source file it was produced from and then assume that dependency is on any (all) of classes declared in that class. Introduction of `binaryClassName` lets us map dependency back to source class name directly and remove that imprecision of dependency tracking. We maintain mapping between source and binary class names just for non-local classes. Check this sbt/sbt#1104 (comment) for the discussion of local and non-local classes. We also rework tracking of products in Analysis by introducing explicitly the concept of local and non-local products corresponding to local and non-local classes. This helps us to clarify for which classes we track source and binary class names. Rewritten from sbt/zinc@f5b0b60
Implements a strategy for recording dependencies introduced by top level imports by assigning those dependencies to the first top level class. In case there are top level imports but no top level class/trait/object defined in a compilation unit, a warning is issued. The rationale for this strategy can be found at: sbt/sbt#1104 (comment) Add an unit test covering different cases of top level imports (e.g. defined in nested packages). Mark the scripted test source-dependencies/import-class as passing after a small modification of adding a top level class. Rewritten from sbt/zinc@607ac6d
Dependencies introduced by local classes require special handling because we track only non local classes (that can be referred from other files) in dependency relations. To overcome this problem, dependencies introduced by a local class are recorded as introduced by an outer class that is non local. However, this introduces a new problem with dependencies introduced by inheritance. We don't want local inheritance dependencies to cause a transitive invalidation of all classes that inherit from the outer class containing the local class. Check the comment in Relations.scala this patches introduces or follow the discussion of this problem at: sbt/sbt#1104 (comment) To capture the subtlety of inheritance dependencies from local classes, we introduce `LocalDependencyByInheritance` case to `DependencyContext` enum. TestCallback has been modified to return extracted local inheritance dependencies and a test in DependencySpecification has been updated accordingly. The Dependency phase has been reworked to handle local classes properly by mapping dependencies to outer, non local classes.Check the implementation for details of the mapping. It's worth mentioning that mapping is implemented as an amortized O(1) lookup so this change doesn't affect performance of extraction phase negatively. The invalidation logic has been modified to take into account inheritance dependencies introduced by local classes. The patch is small because the invalidation logic is very straightforward: we invalidate local inheritance dependencies non-transitively and we do not apply name hashing dependency pruning. Lastly, we mark local-class-inheritance scripted test as passing. Rewritten from sbt/zinc@10c3722
This commit changes how tracking of API data structures is done within the incremental compiler. It changes how APIs are passed around and stored but doesn't change the behavior of the incremental compiler. Here's a summary of what has changed and what's still being tracked at the source file level: - APIs are tracked per class name in a newly introduced Companions data structure; incremental compiler always considers the pair of companions from now on - only APIs for top level classes are extracted at the moment; invalidation is still imprecise - Used names are tracked per source file - Name hashes are tracked per top-level class (they're part of AnalyzedClass data structure) Companion class and object have to be considered as a pair because they're given exactly the same name in the incremental compiler. The idea of naming classes and objects separately has been discussed and rejected here: sbt/sbt#1104 (comment) APIs of companions are linked together in AnalysisCallback. The ExtractAPI compiler phase continues to extract apis of classes and objects separately. More on those changes below. Most changes in this patch are direct consequences of the changes in the `interface/other` file. The `Source` has been replaced by the `AnalyzedClass`. The AnalyzedClass carries an extracted api of the (class, companion object) pair (stored as `Companions`) plus some meta data about the pair (e.g. hash sum of its api representation). Source used to carry both hash sum of the source file contents (hash of the text) and a hash of the api. The hash of the source file has been introduced to shortcut equality checking before api hashing has been introduced. Now it's redundant so it's removed. The `SourceAPI` used to carry information about packages declared in a source file but this information wasn't used in the incremental compiler so its tracking is removed. I also removed an ugly looking `_internalOnly_` prefixes. The changes in this branch are not binary or source compatible so it's a good opportunity to perform some cleanups. AnalysisCallback has a new method `startSource`. It's needed because we want to track sources that do not declare any classes and for which there's no `api` call. The `api` method takes `ClassLike` as an argument and can be called multiple times per a single source file. The implementation of the AnalysisCallback has been affected by the switch from Source to AnalyzedClass tracking the most. The main change here is that AnalysisCallback is responsible for pairing apis of a companion class and a companion object. It stores apis of classes and objects separately and does the same of their name hashes. Right before persisting all that information it puts both apis into Companions data structure and merges name hashes for a class and its companion object. Merging is performed only when both class and object with the same name are declared. It's worth noting why we cannot just compute name hashes once we have a class and its companion object available instead of merging precomputed name hashes. We can't do that because APIs are minimized and only their hashes are stored so names and signatures of members are lost at this point. We have to computer name hashes before minimization but then we don't have both of companions available yet. For that reason `NameHashing.merge` operation has been introduced that performs a straightforward merge. NameHashingSpecification provides a basic test for its properties. The incremental invalidation algorithm has been affected by switching to class level api tracking. As a consequence, most of invalidation is happening at class level now and only right before compilation class names are mapped to source files that declare them. To emphasize that point, recompilation of classes has been refactored to its own method `recompileClasses` in the IncrementalCommon. However, we had two make two exceptions to invalidation performed on class level: 1. Sources that just has been added and we don't know their declared classes yet so we have to scheduled them for compilation as is. 2. Sources that do not declare any classes have to be scheduled for recompilation directly too This is the reason why `cycle` takes both invalidated classes and modified srcs as inputs and why `invalidateInitial` computes both. After the first iteration of `cycle`, the set of modified sources becomes empty and the remaining of invalidation is performed at the class level only. Here's a list of changes I think are worth highlighting either for clarity or to make a point: - SameAPI dropped some old, unused code from TopLevel and NameChanges classes - APIs do not have any reference to `java.io.File` now, this data structure operates purely on class names now - helpers methods for looking up dependency information from Relations has been changed to work on a class level - version number in TextAnalysisFormat has been bumped - the `inherited_type_params` scripted test has been removed as it looks like not testing anything useful and breaks due to changes to the `APIs` interface - Analyze doesn't store empty apis for Java source files that do not declare any classes; we use `AnalysisCallback.startSource` for tracking - The Scala 2.8-specific test has been dropped. - The test for Ant-style compilation is marked as pending. Supporting of Ant-style compilation is tricky because invalidation is happening at the class level now. Rewritten from sbt/zinc@277262d
This is a meta-issue that provides quick overview of what needs to be done in order to change dependency tracking from source file-based to class name-based.
Tasks
Class-based dependency tracking can be implemented in two phases. The third phase is dedicated to testing on large projects.
Invalidate by inheritance based on class dependencies (phase 1)
Tasks marked as completed are completed only in a prototype implementation. No changes has been merged to sbt yet.declaredClasses
to immediately map them back to files until the rest of the algorithm is implemented)API
object corresponding to sealed parent; this will enable us to perform proper invalidation when a new case (class) introduced to a sealed hierarchybinaryClassName: Relation[String, String]
and use it whenever the conversion is needed.Track name hashes per class (phase 2)
declaredClasses
in almost entire algorithm) (would fix Adding a method to a class forces dependent (by inheritance) files to recompile #2320)Only once the last bullet is merged we'll see improvement to incremental compilation when big number of nested classes is involved (e.g. like in Scala compiler's case).
Testing, bug fixing and benchmarking (phase 3)
;clean;compile
performanceMerge changes upstrem, prepare for shipping (phase 4)
Targeted sbt version
This most likely is going to be shipped with sbt 1.0-M1.
Benefits
The improved dependency tracking delivers up to 40x speedups of incremental compilation in scenarios tested. Check benchmarking results here: #1104 (comment)
The speedups are caused by fixing two main issues:
The 1. is likely to be triggered by any code base that uses more than one class defined in a single source file. The 2. is affecting code bases with big number of nested classes that are inherited.
One example is Scala compiler itself. Even with name hashing, we invalidate too much and code edit cycle becomes long whenever a new member is introduced.
This work described in this issue is funded by Lightbend. I'm working on it as a contractor.
The text was updated successfully, but these errors were encountered: