-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What should mf2 textContent parsing result in? User expectation vs. DOM specification. #15
Comments
Nodevia https://glennjones.net/tools/microformats/ ("Experimental ‐ Text white-space collapsing" option not checked, though even when it's checked it does not seem to change with this example.)
|
The actual result I would expect is what is rendered by a browser:
Here it is in Lynx |
It only took 2 months, but I have written a draft specification for handling The algorithm combines:
I have also implemented it in JavaScript so it can be live tested and hereby announce I am willing to implement it into php-mf2 ASAP. Please have a look at the text content from HTML page for the live test and the algorithm. I will probably be moving it to the microformats wiki Soon™. It can then be linked to from other specs. E.g. #20 could be fixed simply by having the vcp spec point at the text content algorithm for its “innertext”. |
While implementing this in PHP, I ran into a little snag where a line break was being preserver at the start of the resulting string. So I have updated the algorithm to strip “leading and trailing ASCII whitespace from output” instead of removing “any leading and trailing I am now throwing more tests at it to see if I should just use ASCII whitespace more often than limiting actions to just spaces etc. |
Thought: @sknebel just wondered if this should be tweaked so whitespace is not collapsed within PRE elements. I wonder what the user expectation is there. |
How does one deal with <article class="h-entry">
<div class="e-content">
<p>Hello<br>
World</p>
<pre>
this is some pre formatted text
this is more pre formatting
</pre>
</div>
</article> what should be the |
According to my browser’s innerText method the plain text of
Or the following after
|
Adding a link to issue #83 on indieweb/microformats-ruby filed by @aaronpk back in March of this year, which is related to whitespace parsing. |
@Zegnat this is great! what's the status? time to move this to the microformats wiki? |
In #microformats today there was some discussion about @snarfed's example as parsed by latest php-mf2: <div class="h-entry">
<div class="e-content p-name">
Hello World
<pre>
one
two
three
</pre>
</div>
</div> "items": [
{
"type": [
"h-entry"
],
"properties": {
"name": [
"Hello World one two three"
],
"content": [
{
"html": "Hello World\n <pre>\n one\n two\n three\n <\/pre>",
"value": "Hello World one two three"
}
]
}
}
] I don't have a strong opinion about what's "right" here, but at a glance that p-name and e-content.value without newlines and tabs looks nice from a (hypothetical) consumer perspective. |
hrm. overriding pre in name might be ok, but almost certainly not in content.value. whitespace inside pre is meaningful the same way tags like br are meaningful, and need to be preserved, as lots of people have argued both here and in applications like bridgy. |
Whitespace in Copying comment I made on a PR regarding I believe there should be a newline before and after |
Yes, I want to move it to the wiki ASAP, and update it with some of @kartikprabhu’s work. So we can do further iteration of the algorithm there, and have its history preserved. I had planned to have it done already, but currently on holiday and internet connectivity has been spotty. I should be back this coming weekend and will be catching up on all things microformats next week! |
Per recent discussions (microformats/microformats2-parsing#15 (comment)) and my proposal there to treat <pre> similar to <p>/display: inline-block as in browsers
I'm fairly certain the text element should be calculated before output rather than stored. Here is an example using Zegnat tool on a portion of https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute I've imagined a HTML input, with outer div omitted<h4 id="the-innertext-idl-attribute"><span class="secno">3.2.7</span> The <code id="the-innertext-idl-attribute:dom-innertext"><a href="#dom-innertext">innerText</a></code> IDL attribute<a href="#the-innertext-idl-attribute" class="self-link"></a></h4>
<div class="status"><input onclick="toggleStatus(this)" value="⋰" type="button"><p class="support"><strong>Support:</strong> innertext<span class="and_chr yes"><span>Chrome for Android</span> <span>80+</span></span><span class="chrome yes"><span>Chrome</span> <span>4+</span></span><span class="ios_saf yes"><span>iOS Safari</span> <span>4.0+</span></span><span class="firefox yes"><span>Firefox</span> <span>45+</span></span><span class="safari yes"><span>Safari</span> <span>3.2+</span></span><span class="samsung yes"><span>Samsung Internet</span> <span>4+</span></span><span class="edge yes"><span>Edge</span> <span>12+</span></span><span class="ie yes"><span>IE</span> <span>6+</span></span><span class="and_uc yes"><span>UC Browser for Android</span> <span>12.12+</span></span><span class="opera yes"><span>Opera</span> <span>9.5+</span></span><span class="op_mini yes"><span>Opera Mini</span> <span>all+</span></span><span class="android yes"><span>Android Browser</span> <span>2.3+</span></span></p><p class="caniuse">Source: <a href="https://caniuse.com/#feat=innertext">caniuse.com</a></p></div>
<aside class="mdn-anno wrapped"><button onclick="toggleStatus(this)" class="mdn-anno-btn"><b title="Support in all current engines." class="all-engines-flag">✔</b><span>MDN</span></button><div class="feature"><p><a href="https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText" title="The innerText property of the HTMLElement interface represents the "rendered" text content of a node and its descendants.">HTMLElement/innerText</a></p><p class="all-engines-text">Support in all current engines.</p><div class="support"><span class="firefox yes"><span>Firefox</span><span>45+</span></span><span class="safari yes"><span>Safari</span><span>3+</span></span><span class="chrome yes"><span>Chrome</span><span>1+</span></span><hr><span class="opera yes"><span>Opera</span><span>9.6+</span></span><span class="edge_blink yes"><span>Edge</span><span>79+</span></span><hr><span class="edge yes"><span>Edge (Legacy)</span><span>12+</span></span><span class="ie yes"><span>Internet Explorer</span><span>5.5+</span></span><hr><span class="firefox_android yes"><span>Firefox Android</span><span>45+</span></span><span class="safari_ios yes"><span>Safari iOS</span><span>4+</span></span><span class="chrome_android yes"><span>Chrome Android</span><span>18+</span></span><span class="webview_android yes"><span>WebView Android</span><span>4.4+</span></span><span class="samsunginternet_android yes"><span>Samsung Internet</span><span>1.0+</span></span><span class="opera_android yes"><span>Opera Android</span><span>10.1+</span></span></div></div></aside>
<dl class="domintro"><dt><var>element</var> . <code id="dom-innertext-dev"><a href="#dom-innertext">innerText</a></code> [ = <var>value</var> ]</dt><dd>
<p>Returns the element's text content "as rendered".</p>
<p>Can be set, to replace the element's children with the given value, but with line breaks
converted to <code id="the-innertext-idl-attribute:the-br-element"><a href="text-level-semantics.html#the-br-element">br</a></code> elements.</p>
</dd></dl>
<p>On getting, the <dfn id="dom-innertext"><code>innerText</code></dfn> attribute must follow
these steps:</p>
<ol><li>
<p>If this element is not <a id="the-innertext-idl-attribute:being-rendered" href="rendering.html#being-rendered">being rendered</a>, or if the user agent is a non-CSS user
agent, then return this element's <a id="the-innertext-idl-attribute:descendant-text-content" href="https://dom.spec.whatwg.org/#concept-descendant-text-content" data-x-internal="descendant-text-content">descendant text content</a>.</p>
<p class="note">This step can produce surprising results, as when the <code id="the-innertext-idl-attribute:dom-innertext-2"><a href="#dom-innertext">innerText</a></code> attribute is accessed on an element not <a id="the-innertext-idl-attribute:being-rendered-2" href="rendering.html#being-rendered">being
rendered</a>, its text contents are returned, but when accessed on an element that is
<a id="the-innertext-idl-attribute:being-rendered-3" href="rendering.html#being-rendered">being rendered</a>, all of its children that are not <a id="the-innertext-idl-attribute:being-rendered-4" href="rendering.html#being-rendered">being rendered</a> have
their text contents ignored.</p>
</li><li><p>Let <var>results</var> be a new empty <a id="the-innertext-idl-attribute:list" href="https://infra.spec.whatwg.org/#list" data-x-internal="list">list</a>.</p></li><li>
<p>For each child node <var>node</var> of this element:</p>
<ol><li><p>Let <var>current</var> be the <a id="the-innertext-idl-attribute:list-2" href="https://infra.spec.whatwg.org/#list" data-x-internal="list">list</a> resulting in running the <a href="#inner-text-collection-steps" id="the-innertext-idl-attribute:inner-text-collection-steps">inner
text collection steps</a> with <var>node</var>. Each item in <var>results</var> will either
be a <a id="the-innertext-idl-attribute:string" href="https://infra.spec.whatwg.org/#string" data-x-internal="string">string</a> or a positive integer (a <i>required line break count</i>).</p>
<p class="note">Intuitively, a <i>required line break count</i> item means that a certain
number of line breaks appear at that point, but they can be collapsed with the line breaks
induced by adjacent <i>required line break count</i> items, reminiscent to CSS
margin-collapsing.</p>
</li><li><p>For each item <var>item</var> in <var>current</var>, append <var>item</var> to
<var>results</var>.</p></li></ol>
</li><li><p><a href="https://infra.spec.whatwg.org/#list-remove" id="the-innertext-idl-attribute:list-remove" data-x-internal="list-remove">Remove</a> any items from <var>results</var> that are the
empty string.</p></li><li><p><a href="https://infra.spec.whatwg.org/#list-remove" id="the-innertext-idl-attribute:list-remove-2" data-x-internal="list-remove">Remove</a> any runs of consecutive <i>required line break
count</i> items at the start or end of <var>results</var>.</p></li><li><p><a href="https://infra.spec.whatwg.org/#list-remove" id="the-innertext-idl-attribute:list-replace" data-x-internal="list-replace">Replace</a> each remaining run of consecutive <i>required
line break count</i> items with a string consisting of as many U+000A LINE FEED (LF) characters
as the maximum of the values in the <i>required line break count</i> items.</p></li><li><p>Return the concatenation of the string items in <var>results</var>.</p></li></ol> RAW JSON encoded text output
Output of JSON rendered into HTML pre-formatted text
My point being that if e-content or p-content were rendered like this, it would be a fairly poor output. Text could be made an optional field, or annotated with some points.
Further: By suggesting that generation of text (non-source content) be proprietary and merely representative, the spec might more clearly communicate the intent. I might for example decide that a markdown format is suitable for an audience; or CURSES for text-mode CLI or teletype-compatible targets. I might for example manually insert ordered list indexes as content with spacing for mobile or plain-text email, and use an asterisk or similar 8-bit compatible character for unordered lists. Where changing the font is impossible, it may be possible to add newlines, underscores or textual decoration to differentiate content. |
Summary
At several points the parsing specification says to return the
textContent
, but it never defines what this means. I personally always assumed the DOMtextContent
property for the current element, but this does not seem to match with what parsers have been doing.Discussion
@aaronpk wrote a blogpost today containing the following, emphasis mine:
I replied to the emphasised statement in chat:
This started a discussion in the #indieweb-dev chat that is best read in the chat logs. The discussion continued in the #micoformats chat. The important take-away is that the PHP parser includes its own text extraction implementation, after an issue was filed by a user that was missing expected white space in the output.
It turned out that the JavaScript parser (glennjones/microformat-shiv) was already doing something like that.
The important part here is user expectation. The user who opened the issue on the PHP parser was expecting to see a line break in the plain text value where a
<br>
used to be. It is also what aaronpk would expect. From chat:I don’t have any real personal preference. I do feel that the parsing specification should define what it wants to guarantee compatibility between parsers.
If we end up defining our own textContent algorithm for HTML→plain-text, I do think we should take a good look at what browsers are doing. Especially plain text browsers such as lynx and w3m.
Parser behaviour
Test:
Tested through microformats.io. Output shortened to only the affected h-entry. Node and Ruby were not available for testing.
PHP
Python
Go
The text was updated successfully, but these errors were encountered: