-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure links extracts to keep original links #1055
Conversation
Currently, reffy "canonicalize" urls before recording them in the links extracts. While that canonicalization is useful to regroup links that points to the same spec, it loses critical information (e.g. for links to multipage specs) and produces information that does not really reflect the reality of the spec, making it hard to produce reliable analysis. The patch instead records the original link untouched, and annotates it with the spec that can be identified after canonicalization. This means the structure of links extract will change.
@@ -294,7 +294,7 @@ async function saveSpecResults(spec, settings) { | |||
async function crawlList(speclist, crawlOptions) { | |||
// Make a shallow copy of crawl options object since we're going | |||
// to modify properties in place | |||
crawlOptions = Object.assign({}, crawlOptions); | |||
crawlOptions = Object.assign({speclist}, crawlOptions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've never been very clean on separating inputs and outputs. Unless I missed a copy somewhere, we directly store the results of the crawl within the spec objects in speclist
, at least temporarily while the results get saved to disk. This means speclist
may also contain a few expanded crawl results (or all of them when useCrawl
is set or when the user does not specify an output folder), up to >50MB of data.
Here, we're asking Puppeteer to send the list to a separate Chrome process, so Puppeteer will need to serialize/deserialize the list each time. That works but that seems like a waste of processing time and memory.
I think we should rather make a deeper copy here and create a new list of spec objects that only contain core metadata about each spec (and that won't be touched by the crawl). That will still send >500KB of data but that should be fine.
I can live with it, but it could perhaps be argued that the extraction code should remain atomic in the sense that its results should only depend on information about the spec and on the spec's contents (extraction code is not the easiest bit to debug since it runs in a headless Chrome instance, so better keep the logic simple). An alternative would be to split the extraction of links into two parts:
- The extraction of the raw links in Chrome through
extract-links.mjs
without attempting to link them back to specs. - A post-processing module that associates extracted links with specs and runs in the Node.js process. Post-processing is a good place to run logic that needs to look at other specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That approach makes more sense indeed; I've updated the PR to that effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good to me, thanks for the update!
See inline comments, regarding Strudy in particular coz' this will break it...
src/postprocessing/annotate-links.js
Outdated
const fs = require('fs'); | ||
const path = require('path'); | ||
|
||
function canonicalizeUrl(url, options) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strudy currently needs canonicalizeUrl
so deleting the mjs file and moving the logic here will break it. That was one of the remaining todos in #747 (comment). I suppose you're looking into Strudy in any case and planning to fix the code there?
src/postprocessing/annotate-links.js
Outdated
|
||
module.exports = { | ||
dependsOn: ['links'], | ||
input: 'crawl', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the post-processor is a bit clunky here. This post-processing module can run directly once the spec has crawled so input: 'spec'
would be the right choice in theory (this would also allow to use the default save
logic that the post-processor contains instead of re-defining it because the post-processor does something else for crawl-wide post-processing modules).
That said, to be able to do that, you would need a way to access the list of specs in the crawl, which could either be done by updating the signature of the run function to run: async function(spec, specs, options)
or by copying speclist
into the options as you had done previously (no need to worry about memory copies a priori).
Anyway, no big deal for now. We can revisit that later on.
Forgot to add: we'll need to update webref's crawl job to run the post-processing module in |
More complex options are only needed in strudy, which will maintain its own version for now See also #1055 (comment)
As suggested in #1055 (comment)
Following change in w3c/reffy#1055 Also split code between CLI and re-usable library
Breaking change: - Restructure links extracts to keep original links (#1055) Dependencies bumped: - Bump web-specs from 2.19.0 to 2.23.0 (#1056) - Bump puppeteer from 16.1.1 to 17.1.1 (#1057) - Bump rollup from 2.78.0 to 2.79.0 (#1054) How to upgrade: The breaking change only affects code that makes use of links extracts. These extracts are now using an object with an anchors property that lists the said anchors, instead of being listed directly as an array.
* Adapt to removal of canonicalization code in reffy per w3c/reffy#1055 (comment) * Adapt to new structure of links data Following change in w3c/reffy#1055 Also split code between CLI and re-usable library * Add comment about duplicated code with Reffy
Currently, reffy "canonicalize" urls before recording them in the links extracts.
While that canonicalization is useful to regroup links that points to the same spec, it loses critical information (e.g. for links to multipage specs) and produces information that does not really reflect the reality of the spec, making it hard to produce reliable analysis.
The patch instead records the original link untouched, and annotates it with the spec that can be identified after canonicalization.
This means the structure of links extract will change.