Skip to content

Commit

Permalink
Extract CDDL definitions (#1723)
Browse files Browse the repository at this point in the history
* Extract CDDL definitions

Needed for w3c/webref#1353.

With this update, Reffy now looks for and extracts CDDL content defined in
`<pre class="cddl">` block. The logic is vastly similar to the logic used for
IDL. Shared code was factored out accordingly.

Something specific about CDDL: on top of generating text extracts, the goal is
also to create one extract per CDDL module that the spec defines. To associate
a `<pre>` block with one or more CDDL module, the code looks for a possible
`data-cddl-module` module, or for module names in the `class` attribute
(prefixed by `cddl-` or suffixed by `-cddl`). The former isn't used by any spec
but is the envisioned mechanism in Bikeshed to define the association, the
latter is the convention currently used in the WebDriver BiDi specification.

When a spec defines modules, CDDL defined in a `<pre>` block with no explicit
module annotation is considered to be defined for all modules (not doing so
would essentially mean the CDDL would not be defined for any module, which
seems weird).

When there is CDDL, the extraction produces:
1. an extract that contains all CDDL definitions: `cddl/[shortname].cddl`
2. one extract per CDDL module: `cddl/[shortname]-[modulename].cddl`
(I'm going to assume that no one is ever going to define a module name that
would make `[shortname]-[modulename]` collide with the shortname of another
spec).

Note: some specs that define CDDL do not flag the `<pre>` blocks in any way
(Open Screen Protocol, WebAuthn). Extraction won't work for them for now. Also,
there are a couple of places in the WebDriver BiDi spec that use a
`<pre class="cddl">` block to *reference* a CDDL construct defined elsewhere.
Extraction will happily include these references as well, leading to CDDL
extracts that contain invalid CDDL. These need fixing in the specs.

* Change name of "all" extract, allow CDDL defs for it

When a spec defines CDDL modules, the union of all CDDL is now written to a
file named `[shortname]-all.cddl` instead of simply `[shortname].cddl`. This
is meant to make it slightly clearer that the union of all CDDL file is not
necessarily the panacea. For example, it may not contain a useful first rule
against which a CBOR data item that would match any of the modules may be
validated.

In other words, when the crawler produces a `[shortname].cddl` file, that
means there's no module. If it doesn't, best is to check the module, with
"all" being a reserved module name in the spec that gets interpreted to mean
"any module".

When a spec defines CDDL modules, it may also define CDDL rules that only
appear in the "all" file by specifying `data-cddl-module="all"`. This is useful
to define a useful first type in the "all" extract.

* Integrate review feedback
  • Loading branch information
tidoust authored Dec 13, 2024
1 parent ed90a50 commit fc8c944
Show file tree
Hide file tree
Showing 9 changed files with 433 additions and 51 deletions.
125 changes: 125 additions & 0 deletions src/browserlib/extract-cddl.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
import getCodeElements from './get-code-elements.mjs';
import trimSpaces from './trim-spaces.mjs';

/**
* Extract the list of CDDL definitions in the current spec.
*
* A spec may define more that one CDDL module. For example, the WebDriver BiDi
* spec has CDDL definitions that apply to either of both the local end and the
* remote end. The functions returns an array that lists all CDDL modules.
*
* Each CDDL module is represented as an object with the following keys whose
* values are strings:
* - shortname: the CDDL module shortname. Shortname is "" if the spec does not
* define any module, and "all" for the dump of all CDDL definitions.
* - label: A full name for the CDDL module, when defined.
* - cddl: A dump of the CDDL definitions.
*
* If the spec defines more than one module, the first item in the array is the
* "all" module that contains a dump of all CDDL definitions, regardless of the
* module they are actually defined for (the assumption is that looking at the
* union of all CDDL modules defined in a spec will always make sense, and that
* a spec will never reuse the same rule name with a different definition for
* different CDDL modules).
*
* @function
* @public
* @return {Array} A dump of the CDDL definitions per CDDL module, or an empty
* array if the spec does not contain any CDDL.
*/
export default function () {
// Specs with CDDL are either recent enough that they all use the same
// `<pre class="cddl">` convention, or they don't flag CDDL blocks in any
// way, making it impossible to extract them.
const cddlSelectors = ['pre.cddl:not(.exclude):not(.extract)'];
const excludeSelectors = ['#cddl-index'];

// Retrieve all elements that contains CDDL content
const cddlEls = getCodeElements(cddlSelectors, { excludeSelectors });

// Start by assembling the list of modules
const modules = {};
for (const el of cddlEls) {
const elModules = getModules(el);
for (const name of elModules) {
// "all" does not create a module on its own, that's the name of
// the CDDL module that contains all CDDL definitions.
if (name !== 'all') {
modules[name] = [];
}
}
}

// Assemble the CDDL per module
const mergedCddl = [];
for (const el of cddlEls) {
const cddl = trimSpaces(el.textContent);
if (!cddl) {
continue;
}
// All CDDL appears in the "all" module.
mergedCddl.push(cddl);
let elModules = getModules(el);
if (elModules.length === 0) {
// No module means the CDDL is defined for all modules
elModules = Object.keys(modules);
}
for (const name of elModules) {
// CDDL defined for the "all" module is only defined for it
if (name !== 'all') {
if (!modules[name]) {
modules[name] = [];
}
modules[name].push(cddl);
}
}
}

if (mergedCddl.length === 0) {
return [];
}

const res = [{
name: Object.keys(modules).length > 0 ? 'all' : '',
cddl: mergedCddl.join('\n\n')
}];
for (const [name, cddl] of Object.entries(modules)) {
res.push({ name, cddl: cddl.join('\n\n') });
}
// Remove trailing spaces and use spaces throughout
for (const cddlModule of res) {
cddlModule.cddl = cddlModule.cddl
.replace(/\s+$/gm, '\n')
.replace(/\t/g, ' ')
.trim();
}
return res;
}


/**
* Retrieve the list of CDDL module shortnames that the element references.
*
* This list of modules is either specified in a `data-cddl-module` attribute
* or directly within the class attribute prefixed by `cddl-` or suffixed by
* `-cddl`.
*/
function getModules(el) {
const moduleAttr = el.getAttribute('data-cddl-module');
if (moduleAttr) {
return moduleAttr.split(',').map(str => str.trim());
}

const list = [];
const classes = el.classList.values()
for (const name of classes) {
const match = name.match(/^(.*)-cddl$|^cddl-(.*)$/);
if (match) {
const shortname = match[1] ?? match[2];
if (!list.includes(shortname)) {
list.push(shortname);
}
}
}
return list;
}
65 changes: 15 additions & 50 deletions src/browserlib/extract-webidl.mjs
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
import getGenerator from './get-generator.mjs';
import informativeSelector from './informative-selector.mjs';
import cloneAndClean from './clone-and-clean.mjs';
import getCodeElements from './get-code-elements.mjs';
import trimSpaces from './trim-spaces.mjs';

/**
* Extract the list of WebIDL definitions in the current spec
*
* @function
* @public
* @return {Promise} The promise to get a dump of the IDL definitions, or
* an empty string if the spec does not contain any IDL.
* @return {String} A dump of the IDL definitions, or an empty string if the
* spec does not contain any IDL.
*/
export default function () {
const generator = getGenerator();
Expand Down Expand Up @@ -70,56 +70,21 @@ function extractBikeshedIdl() {
* sure that it only extracts elements once.
*/
function extractRespecIdl() {
// Helper function that trims individual lines in an IDL block,
// removing as much space as possible from the beginning of the page
// while preserving indentation. Rules followed:
// - Always trim the first line
// - Remove whitespaces from the end of each line
// - Replace lines that contain spaces with empty lines
// - Drop same number of leading whitespaces from all other lines
const trimIdlSpaces = idl => {
const lines = idl.trim().split('\n');
const toRemove = lines
.slice(1)
.filter(line => line.search(/\S/) > -1)
.reduce(
(min, line) => Math.min(min, line.search(/\S/)),
Number.MAX_VALUE);
return lines
.map(line => {
let firstRealChat = line.search(/\S/);
if (firstRealChat === -1) {
return '';
}
else if (firstRealChat === 0) {
return line.replace(/\s+$/, '');
}
else {
return line.substring(toRemove).replace(/\s+$/, '');
}
})
.join('\n');
};

// Detect the IDL index appendix if there's one (to exclude it)
const idlEl = document.querySelector('#idl-index pre') ||
document.querySelector('.chapter-idl pre'); // SVG 2 draft

let idl = [
const idlSelectors = [
'pre.idl:not(.exclude):not(.extract):not(#actual-idl-index)',
'pre:not(.exclude):not(.extract) > code.idl-code:not(.exclude):not(.extract)',
'pre:not(.exclude):not(.extract) > code.idl:not(.exclude):not(.extract)',
'div.idl-code:not(.exclude):not(.extract) > pre:not(.exclude):not(.extract)',
'pre.widl:not(.exclude):not(.extract)'
]
.map(sel => [...document.querySelectorAll(sel)])
.reduce((res, elements) => res.concat(elements), [])
.filter(el => el !== idlEl)
.filter((el, idx, self) => self.indexOf(el) === idx)
.filter(el => !el.closest(informativeSelector))
.map(cloneAndClean)
.map(el => trimIdlSpaces(el.textContent))
.join('\n\n');
];

return idl;
const excludeSelectors = [
'#idl-index',
'.chapter-idl'
];

const idlElements = getCodeElements(idlSelectors, { excludeSelectors });
return idlElements
.map(el => trimSpaces(el.textContent))
.join('\n\n');
}
21 changes: 21 additions & 0 deletions src/browserlib/get-code-elements.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import informativeSelector from './informative-selector.mjs';
import cloneAndClean from './clone-and-clean.mjs';

/**
* Helper function that returns a set of code elements in document order based
* on a given set of selectors, excluding elements that are within an index.
*
* The function excludes elements defined in informative sections.
*
* The code elements are cloned and cleaned before they are returned to strip
* annotations and other asides.
*/
export default function getCodeElements(codeSelectors, { excludeSelectors = [] }) {
return [...document.querySelectorAll(codeSelectors.join(', '))]
// Skip excluded and elements and those in informative content
.filter(el => !el.closest(excludeSelectors.join(', ')))
.filter(el => !el.closest(informativeSelector))

// Clone and clean the elements
.map(cloneAndClean);
}
4 changes: 4 additions & 0 deletions src/browserlib/reffy.json
Original file line number Diff line number Diff line change
Expand Up @@ -62,5 +62,9 @@
"href": "./extract-ids.mjs",
"property": "ids",
"needsIdToHeadingMap": true
},
{
"href": "./extract-cddl.mjs",
"property": "cddl"
}
]
36 changes: 36 additions & 0 deletions src/browserlib/trim-spaces.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/**
* Helper function that trims individual lines in a code block, removing as
* much space as possible from the beginning of the page while preserving
* indentation.
*
* Typically useful for CDDL and IDL extracts
*
* Rules followed:
* - Always trim the first line
* - Remove whitespaces from the end of each line
* - Replace lines that contain spaces with empty lines
* - Drop same number of leading whitespaces from all other lines
*/
export default function trimSpaces(code) {
const lines = code.trim().split('\n');
const toRemove = lines
.slice(1)
.filter(line => line.search(/\S/) > -1)
.reduce(
(min, line) => Math.min(min, line.search(/\S/)),
Number.MAX_VALUE);
return lines
.map(line => {
let firstRealChar = line.search(/\S/);
if (firstRealChar === -1) {
return '';
}
else if (firstRealChar === 0) {
return line.replace(/\s+$/, '');
}
else {
return line.substring(toRemove).replace(/\s+$/, '');
}
})
.join('\n');
}
30 changes: 29 additions & 1 deletion src/lib/specs-crawler.js
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,29 @@ async function saveSpecResults(spec, settings) {
return `css/${spec.shortname}.json`;
};

async function saveCddl(spec) {
let cddlHeader = `
; GENERATED CONTENT - DO NOT EDIT
; Content was automatically extracted by Reffy into webref
; (https://github.com/w3c/webref)
; Source: ${spec.title} (${spec.crawled})`;
cddlHeader = cddlHeader.replace(/^\s+/gm, '').trim() + '\n\n';
const res = [];
for (const cddlModule of spec.cddl) {
const cddl = cddlHeader + cddlModule.cddl + '\n';
const filename = spec.shortname +
(cddlModule.name ? `-${cddlModule.name}` : '') +
'.cddl';
await fs.promises.writeFile(
path.join(folders.cddl, filename), cddl);
res.push({
name: cddlModule.name,
file: `cddl/${filename}`
});
}
return res;
};

// Save IDL dumps
if (spec.idl) {
spec.idl = await saveIdl(spec);
Expand Down Expand Up @@ -283,9 +306,14 @@ async function saveSpecResults(spec, settings) {
(typeof thing == 'object') && (Object.keys(thing).length === 0);
}

// Save CDDL extracts (text files, multiple modules possible)
if (!isEmpty(spec.cddl)) {
spec.cddl = await saveCddl(spec);
}

// Save all other extracts from crawling modules
const remainingModules = modules.filter(mod =>
!mod.metadata && mod.property !== 'css' && mod.property !== 'idl');
!mod.metadata && !['cddl', 'css', 'idl'].includes(mod.property));
for (const mod of remainingModules) {
await saveExtract(spec, mod.property, spec => !isEmpty(spec[mod.property]));
}
Expand Down
30 changes: 30 additions & 0 deletions src/lib/util.js
Original file line number Diff line number Diff line change
Expand Up @@ -796,6 +796,36 @@ async function expandSpecResult(spec, baseFolder, properties) {
return;
}

// Treat CDDL extracts separately, one spec may have multiple CDDL
// extracts (actual treatment is similar to IDL extracts otherwise)
if (property === 'cddl') {
if (!spec[property]) {
return;
}
for (const cddlModule of spec[property]) {
if (!cddlModule.file) {
continue;
}
if (baseFolder.startsWith('https:')) {
const url = (new URL(cddlModule.file, baseFolder)).toString();
const response = await fetch(url, { nolog: true });
contents = await response.text();
}
else {
const filename = path.join(baseFolder, cddlModule.file);
contents = await fs.readFile(filename, 'utf8');
}
if (contents.startsWith('; GENERATED CONTENT - DO NOT EDIT')) {
// Normalize newlines to avoid off-by-one slices when we remove
// the trailing newline that was added by saveCddl
contents = contents.replace(/\r/g, '');
const endOfHeader = contents.indexOf('\n\n');
contents = contents.substring(endOfHeader + 2).slice(0, -1);
}
cddlModule.cddl = contents;
}
}

// Only consider properties that link to an extract, i.e. an IDL
// or JSON file in subfolder.
if (!spec[property] ||
Expand Down
3 changes: 3 additions & 0 deletions tests/crawl-test.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
},
"title": "WOFF2",
"algorithms": [],
"cddl": [],
"css": {
"atrules": [],
"properties": [],
Expand Down Expand Up @@ -99,6 +100,7 @@
"title": "No Title",
"generator": "respec",
"algorithms": [],
"cddl": [],
"css": {
"atrules": [],
"properties": [],
Expand Down Expand Up @@ -224,6 +226,7 @@
},
"title": "[No title found for https://w3c.github.io/accelerometer/]",
"algorithms": [],
"cddl": [],
"css": {
"atrules": [],
"properties": [],
Expand Down
Loading

0 comments on commit fc8c944

Please sign in to comment.