Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Abstract away optional AST traversals (first step) #1426

Merged
merged 4 commits into from
Nov 26, 2018

Conversation

oowekyala
Copy link
Member

@oowekyala oowekyala commented Nov 2, 2018

First PR for 7.0.0! 😄 🤘

Basically this is a first step to simplify the way we handle optional AST traversals like typeresolution or symbol table. It's necessary to detect XPath dependencies automatically, and globally will make our life easier.

Why we need a change

Type resolution, DFA, etc. all use a visitor that runs on the AST, wrapped in not-so-useful "façades" and provided by a LanguageVersionHandler. At the moment:

  • Every language must have all of them, even if it's just a dummy. This is bad because these stages are ultimately language-specific ---e.g. not all languages will implement their type res with a preliminary qname resolution like Java
    • Each language should be able to define themselves the visits they want to schedule independently of how other languages do it
  • They are all executed in the same order for all language (logic is in SourceCodeProcessor), even though the order is an implementation detail which should rest... with the language implementation.
  • Adding a new processing stage which only some rules activate is useful, yet requires a change to the LanguageVersionHandler interface, to its abstract class, to the SourceCodeProcessor, to the Rule interface, etc. The change must cascade from Rule to RuleSet to RuleSets. It also requires a change to the ruleset schema. Hmm spaghetti 🍝
    • These optional AST visits should be handled uniformly by pmd-core, and adding or removing them only entail changes to the language implementation, not pmd-core
  • Current timing logic and check for dependency is duplicated for dfa, typeres, multifile, etc.

AST processing stages all have in common that:

  • They take an AST + other config (e.g. a classloader) and perform side effects on the AST (e.g. to populate qnames).
  • They all run after the parsing stage and before rule application

This forms a basis for abstracting them.

What's in this PR

  • Processing stages are reified under the interface AstProcessingStage.
  • LanguageVersionHandlers know the full set of AstProcessingStage a language version supports.
  • AstProcessingStages may declare dependencies on other stages. E.g. Java typeres depends on qname resolution.
  • AstProcessingStages are implemented by an enum for each language that supports some. This is quite natural for ordering reasons. Adding or removing AST visits is easy, so is specifying dependencies.
  • The Rule interface has a new method dependsOn(AstProcessingStage), that returns whether the rule depends on the stage or not.

What's not in this PR

  • The actual way a rule declares a dependency on an stage is still unclear. Probably there will be a method on the Rule interface that returns the AstProcessingStages the rule depends on. This can then be implemented by abstract rule classes.
  • The processing stages are not wired into SourceCodeProcessor for now.
    • TimedOperationCategory needs to be refactored to a class to be extensible. This is probably the next PR.
    • The fact that RuleSets is mutable poses a problem to avoiding unnecessary computation. Either we make it immutable, or we have a new immutable class that takes care of precomputing all that can be done in advance. I'd really like them to make them immutable, because it will be much simpler to build them and reason about them, but this will change the API significantly. Wdyt?
    • This is further explained in a comment in SourceCodeProcessor.
  • Since it's not wired I haven't removed the former API, just deprecated it. Deprecations will need to be ported to the master branch.

@oowekyala oowekyala added this to the 7.0.0 milestone Nov 2, 2018
@oowekyala oowekyala force-pushed the extensible-ast-processing-stages branch from da64520 to 847f294 Compare November 2, 2018 07:40
@pmd-test
Copy link

pmd-test commented Nov 2, 2018

1 Warning
⚠️ Running pmdtester failed, this message is mainly used to remind the maintainers of PMD.

Generated by 🚫 Danger

@adangel
Copy link
Member

adangel commented Nov 20, 2018

Some general thoughts:

They are all executed in the same order for all language (logic is in SourceCodeProcessor), even
though the order is an implementation detail which should rest... with the language implementation.

Fully agree. We have also some unit tests, that manually call the parser and if they need typeres, they then manually need the different facades - and they need to call them in the correct order (e.g. qname before typeres). And it is definitely a language implementation detail.

AstProcessingStages may declare dependencies on other stages. E.g. Java typeres depends on
qname resolution.

Do we need this detail? If the language handler knows all the stages, it should also know the order, in which they are executed. If an AstProcessingStage depends on another stage, maybe we shouldn't have it separated at all?
I'm just trying to avoid the logic of determining the correct ordering of the AstProcessingStages based on the dependency - it would be much easier, if the LanguageHandler impl would just define the correct order.

The Rule interface has a new method dependsOn(AstProcessingStage), that returns whether the
rule depends on the stage or not.

This would allow to have maybe some performance improvements if (and only if) you are only executing rules, that don't need all stages. I would argue for removing this feature:

  • Most likely, you are executing more than one rule - so, the more rules you have in the ruleset, the higher the chances are, that you need all stages. So, the performance benefit would only be visible in unit tests, where you might execute rule by rule.
  • It's error prone to declare for each rule, which stages are needed. And as you pointed out: adding a new stage might result in changing the ruleset schema (if we want some type safety down to this and not just strings).
  • You only get the full power of analysis, if you make use of all stages - so, I'd rather enable all stages by default always. If we have performance issues, we should work on improving the processing and not removing some stages, that we might not need for a specific rule, but the stage would be enabled anyways, because a different rule in the ruleset needs it...

The changes you proposed make sense. I'll look into the code now. Thanks!

* @since 7.0.0
*/
@Experimental
default boolean dependsOn(AstProcessingStage<?> stage) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in my general comment, I'd argue for not introducing this feature to enable/disable specific stages...

// basically:
// 1. make the union of all stage dependencies of each rule, by language, for the Rulesets
// 2. order them by dependency
// 3. run them and time them if needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be simplified to just: "run them all and time them"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enticing in theory, but I think we should first address the performance issues, and then enable it by default

// The approach I'd like to take is either
// * to create a new RunnableRulesets class which is immutable, and performs all these preliminary
// computations upon construction.
// * or to modify Ruleset and Rulesets to be immutable. This IMO is a better option because it makes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think, immutability is the way to go. Ruleset is already immutable, isn't it? So, we would just need make Rulesets immutable - this would be then the configured rules, that are to be executed.

Since executing rules runs in multiple threads and rules do not need to be thread-safe, we'll still need to be able to clone a complete Rulesets instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make RuleSets immutable. Deprecate mutation methods and move fordward

*
* @return VisitorStarter
*/
// TODO should we deprecate? Not much use to it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DumpFacade: It's more a tool to help during development. We could deprecated it and point to the Designer, that should be used instead to get a look at the AST....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely!

* after parsing is done. Each of these stages implicitly
* depends on the parser stage.
*
* <p>An analysis on a file goes through the following stages:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would you see the multifile analysis? Is this a stage as well?
For multifile analysis, we probably can only run the rules, that need it, after all files have been fully parsed. So, the collection of the data could be done in a stage, that is executed individually for each file, but I think, the rules, that make use of it, need to be executed separately, after rulechain rules and other rules have been executed... Wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think multifile wouldn't fit neatly into this framework. I'd like it more to be something transparent to the rules.

I imagine we could have a separate class of visitors (not rules) that run after parsing and before rule application. These are declared by rules, and each gather some information relevant to the rule that declares them. This could be a stage, but the actual visitors being applied depends on the contents of the ruleset, so it probably needs special treatment...

Upon rule application the mutlifile rules query the contents of a global index, knowing what keys their data collection visitors registered and being able to access that info for all files of the analysis. Multifile rules wouldn't need an AST to run, so they should also be treated very specially.

I read a bit about IntelliJ's way of doing it, which is very different although very powerful. We could apply some of the things they do to our codebase. We can perhaps talk about it in the next meeting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multifile rules wouldn't need an AST to run, so they should also be treated very specially.

Right, multifile rules would be very different to "normal" rules...

public RuleViolationFactory getRuleViolationFactory() {
throw new UnsupportedOperationException("getRuleViolationFactory() is not supported for C++");
protected String getLanguageName() {
return "C++";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use here net.sourceforge.pmd.lang.cpp.CppLanguageModule.NAME. Somewhere I've read your question about why we have separate CPD and PMD language modules...
Well, for now, we use the java.util.ServiceLoader<Language> facility to locate the available languages on the classpath - we have a CPD language class net.sourceforge.pmd.cpd.Language and a PMD language class net.sourceforge.pmd.lang.Language. We could unify them and move into the class the information, whether it's CPD only or PMD only.
There was once also the discussion, whether CPD should be part of PMD or a completely separate tool. If we unify them, we merge the two tools even closer together...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complete CppHandler class is unnecessary and can be removed with 7.0.0. See #1481.

@adangel adangel self-assigned this Nov 26, 2018
@adangel adangel merged commit 938efd1 into pmd:pmd/7.0.x Nov 26, 2018
@oowekyala oowekyala deleted the extensible-ast-processing-stages branch November 26, 2018 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in:pmd-internals Affects PMD's internals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants