Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINVOKER-351] Escape special xml character in junit report #242

Merged
merged 6 commits into from
May 22, 2024

Conversation

slawekjaranowski
Copy link
Member

  • use StringEscapeUtils.escapeXml10 form commons-text

https://issues.apache.org/jira/browse/MINVOKER-351

@michael-o
Copy link
Member

I don't understand this because it looks logically wrong. The model does not care how it is marshaled, shouldn't the XML writer do this?

@slawekjaranowski
Copy link
Member Author

@michael-o there is a problem with special chars ... they are not escaped / removed by the plexus-xml

- use StringEscapeUtils.escapeXml10 form commons-text
@michael-o
Copy link
Member

@michael-o there is a problem with special chars ... they are not escaped / removed by the plexus-xml

Is there an upstream issue for this?

@michael-o
Copy link
Member

Read the JIRA issue. I assume we are talking about chars outside of https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.0?

@slawekjaranowski
Copy link
Member Author

@michael-o
Copy link
Member

Looking into this. Can we make sure that we do not double escape the five protected chars.

@slawekjaranowski
Copy link
Member Author

when we will use MXSerializer we will have a exception ... so special characters should be somehow filtered

https://github.com/codehaus-plexus/plexus-xml/blob/master/src/main/java/org/codehaus/plexus/util/xml/pull/MXSerializer.java#L937-L958

@slawekjaranowski
Copy link
Member Author

Looking into this. Can we make sure that we do not double escape the five protected chars.

I used:

 Xpp3DomWriter.write(new PrettyPrintXMLWriter(osw), testsuite, false);

so escaping is turned off in Xpp3DomWriter

@michael-o
Copy link
Member

@elharo Should an XML writer escape invalid chars to entities or fail here?

@michael-o
Copy link
Member

Looking into this. Can we make sure that we do not double escape the five protected chars.

I used:

 Xpp3DomWriter.write(new PrettyPrintXMLWriter(osw), testsuite, false);

so escaping is turned off in Xpp3DomWriter

Maybe you should add a comment in the source code to depict this.

@michael-o
Copy link
Member

Am I stupid or where am I supposed to find the output of new Example().printAscii();? I can't find it.

@slawekjaranowski
Copy link
Member Author

Am I stupid or where am I supposed to find the output of new Example().printAscii();? I can't find it.

There is:

public class Example {
  public void printAscii() {
    for (int i = 0; i < Byte.MAX_VALUE; ++i) {
      System.out.println((char) i);
    }
  }

Standard output of test is pass to log file and next to report

@michael-o
Copy link
Member

Am I stupid or where am I supposed to find the output of new Example().printAscii();? I can't find it.

There is:

public class Example {
  public void printAscii() {
    for (int i = 0; i < Byte.MAX_VALUE; ++i) {
      System.out.println((char) i);
    }
  }

Standard output of test is pass to log file and next to report

I know, but I don't see that stdout anywhere.

Copy link
Contributor

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the buildLog already XML? Looks like it might be. If so, this is the wrong place to fix this problem. The escaping needs to happen when that XML is created, and this shouldn't write XML at all, just copy bytes.

@@ -1640,7 +1642,7 @@ private void runBuild(
}
} catch (RunFailureException e) {
buildJob.setResult(e.getType());
buildJob.setFailureMessage(e.getMessage());
buildJob.setFailureMessage(StringEscapeUtils.escapeXml10(e.getMessage()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't followed through the complete code yet, but the failure message should generally only be escaped once, and that should be left to the XML writer. That is, don't escape a string that appears in PCDATA or an attribute value. Let the writer do that.

This all assumes that the writer does correctly escape everything. Some libraries have borked this up in one way or another. Looking at it now Xpp3DomWriter is a mess and doesn't have a clean model of when and what to escape. It seems to confuse the unescaped model strings with the serialized escaped strings, and switches back and forth between escaped or not depending on switches.

I'm not sure exactly how to fix this issue. Really what needs to happen is either fix XPP3DomWriter and likely XPPDom, while breaking the API and all existing dependents, or switch to a better designed XML library here. Short of that maybe you can make this work with enough unit tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildJob is class generated by modello internally MXSerializer is used

Also here exception is generated by post-build scrip and message contains output of it .... it also be everything

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably need to pop the code open to figure this out. What is "buildJob"? It's a org.apache.maven.plugins.invoker.model.BuildJob, right? If so, why does it need escaped XML test instead of just plain text? I'd expect it to supply unescaped strings to the writer which would then do any necessary escaping.

In line 1639 above, the message is not escaped. It should likely be escaped in both or neither.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other place message is produced in code and we sure that does not contains special chars.

Only exception.getMessage() can return everything

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming buildJob is just an object that contains strings and other things, but has no particular XML internals (like 99.9% of classes that contain strings) nothing inside it should be escaped.

Ideally the XML writer should escape strings when it needs to. Or is the XML writer is broken and won't do that, then the escaping should be done after the string is extracted from the buildJob and before it's passed to the writer. Escaping strings inside the buildJob breaks its semantics and tightly couples all the classes together in one particularly brittle path.

Copy link
Member

@michael-o michael-o May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming buildJob is just an object that contains strings and other things, but has no particular XML internals (like 99.9% of classes that contain strings) nothing inside it should be escaped.

Ideally the XML writer should escape strings when it needs to. Or if the XML writer is broken and won't do that, then the escaping should be done after the string is extracted from the buildJob and before it's passed to the writer. Escaping strings inside the buildJob breaks its semantics and tightly couples all the classes together in one particularly brittle path.

Yes, the XML writer is broken and needs an issue.

@@ -1779,7 +1781,7 @@ private void writeJunitReport(BuildJob buildJob, String safeFileName) throws Moj
if (buildLogFile != null && buildLogFile.exists()) {
getLog().debug("fileLogger:" + buildLogFile);
try {
systemOut.setValue(FileUtils.fileRead(buildLogFile));
systemOut.setValue(StringEscapeUtils.escapeXml10(FileUtils.fileRead(buildLogFile)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I don't yet understand is what the "value" being set here is in Xpp3Dom. I think (could be wrong) it's supposed to be the raw text to be output, which means Xpp3Dom is in fact not a document object model at all. The confusion between object model and serialized form is a real problem with Xpp3Dom.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value is a String, that's it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildLogFile contains raw output of tested Maven project so can contain everything.

@slawekjaranowski
Copy link
Member Author

As we see root cause is in plexus-xml
I know that using escapeXml10 is a workaround here.

So what is your proposition to solve issue in m-invoker-p?

@michael-o @elharo

@michael-o
Copy link
Member

As we see root cause is in plexus-xml I know that using escapeXml10 is a workaround here.

So what is your proposition to solve issue in m-invoker-p?

@michael-o @elharo

Regardless of the fix or the flaw in Plexus XML I completely fail to understand the supplied reprocuder in the JIRA issue.

@slawekjaranowski
Copy link
Member Author

As we see root cause is in plexus-xml I know that using escapeXml10 is a workaround here.
So what is your proposition to solve issue in m-invoker-p?
@michael-o @elharo

Regardless of the fix or the flaw in Plexus XML I completely fail to understand the supplied reprocuder in the JIRA issue.

Ok.

  • we execute m-invoker-p for project src/it/MINVOKER-351
    • m-invoker-p - execute a project src/it/MINVOKER-351/src/it/minvoker-351
      • in src/it/MINVOKER-351/src/it/minvoker-351 we have a unit test which print some special chars to build output
    • m-invoker-p - collect output of src/it/MINVOKER-351/src/it/minvoker-351 and store in buildlog file
    • m-invoker-p - generate a junit report with systemOut element which contains body of buildlog file
  • execute site-plugin with surefire-report to confirm that generated junit reports are ok

@michael-o michael-o marked this pull request as draft May 17, 2024 18:17
@michael-o
Copy link
Member

Moving to draft because the IT is poorly designed.

@slawekjaranowski
Copy link
Member Author

Moving to draft because the IT is poorly designed.

I hope you want to redesign proposed IT or you have any other hints how should be done.

@michael-o
Copy link
Member

Moving to draft because the IT is poorly designed.

I hope you want to redesign proposed IT or you have any other hints how should be done.

It is rather the conditions the IT comes into play. Please read my analysis on the JIRA issue.

@michael-o
Copy link
Member

@slawekjaranowski Just ran the changed IT with 3.8.8 and 3.9.6. In 3.8.8 I don't see output from 0x00 to 0x0F at all in build.log. Regardless of the escaping you applied. There is some inherent bug here which I prefer to understand first because applying this change.

@michael-o
Copy link
Member

@slawekjaranowski Please ping compiler and surefire plugin in the embedded IT, that's the reason why there is no output.

@michael-o
Copy link
Member

Please also prepend: https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#getName%28int%29
Then we know what we print out.

@michael-o
Copy link
Member

Naive upstream solution: codehaus-plexus/plexus-xml#28

@michael-o
Copy link
Member

I have now tested the IT with the patched Plexus XML. Output looks fine now:

[INFO] Running example.minvoker351.ExampleTest
&#0; - name: NULL
&#1; - name: START OF HEADING
&#2; - name: START OF TEXT
&#3; - name: END OF TEXT
&#4; - name: END OF TRANSMISSION
&#5; - name: ENQUIRY
&#6; - name: ACKNOWLEDGE
&#7; - name: BELL
&#8; - name: BACKSPACE
	 - name: CHARACTER TABULATION

 - name: LINE FEED (LF)
&#11; - name: LINE TABULATION
&#12; - name: FORM FEED (FF)

 - name: CARRIAGE RETURN (CR)
&#14; - name: SHIFT OUT
&#15; - name: SHIFT IN
&#16; - name: DATA LINK ESCAPE
&#17; - name: DEVICE CONTROL ONE
&#18; - name: DEVICE CONTROL TWO
&#19; - name: DEVICE CONTROL THREE
&#20; - name: DEVICE CONTROL FOUR
&#21; - name: NEGATIVE ACKNOWLEDGE
&#22; - name: SYNCHRONOUS IDLE
&#23; - name: END OF TRANSMISSION BLOCK
&#24; - name: CANCEL
&#25; - name: END OF MEDIUM
&#26; - name: SUBSTITUTE
&#27; - name: ESCAPE
&#28; - name: INFORMATION SEPARATOR FOUR
&#29; - name: INFORMATION SEPARATOR THREE
&#30; - name: INFORMATION SEPARATOR TWO
&#31; - name: INFORMATION SEPARATOR ONE
  - name: SPACE
! - name: EXCLAMATION MARK
&quot; - name: QUOTATION MARK
# - name: NUMBER SIGN
$ - name: DOLLAR SIGN
% - name: PERCENT SIGN
&amp; - name: AMPERSAND
&apos; - name: APOSTROPHE
( - name: LEFT PARENTHESIS
) - name: RIGHT PARENTHESIS
* - name: ASTERISK
+ - name: PLUS SIGN
, - name: COMMA
- - name: HYPHEN-MINUS
. - name: FULL STOP
/ - name: SOLIDUS
0 - name: DIGIT ZERO
1 - name: DIGIT ONE
2 - name: DIGIT TWO
3 - name: DIGIT THREE
4 - name: DIGIT FOUR
5 - name: DIGIT FIVE
6 - name: DIGIT SIX
7 - name: DIGIT SEVEN
8 - name: DIGIT EIGHT
9 - name: DIGIT NINE
: - name: COLON
; - name: SEMICOLON
&lt; - name: LESS-THAN SIGN
= - name: EQUALS SIGN
&gt; - name: GREATER-THAN SIGN
? - name: QUESTION MARK
@ - name: COMMERCIAL AT
A - name: LATIN CAPITAL LETTER A
B - name: LATIN CAPITAL LETTER B
C - name: LATIN CAPITAL LETTER C
D - name: LATIN CAPITAL LETTER D
E - name: LATIN CAPITAL LETTER E
F - name: LATIN CAPITAL LETTER F
G - name: LATIN CAPITAL LETTER G
H - name: LATIN CAPITAL LETTER H
I - name: LATIN CAPITAL LETTER I
J - name: LATIN CAPITAL LETTER J
K - name: LATIN CAPITAL LETTER K
L - name: LATIN CAPITAL LETTER L
M - name: LATIN CAPITAL LETTER M
N - name: LATIN CAPITAL LETTER N
O - name: LATIN CAPITAL LETTER O
P - name: LATIN CAPITAL LETTER P
Q - name: LATIN CAPITAL LETTER Q
R - name: LATIN CAPITAL LETTER R
S - name: LATIN CAPITAL LETTER S
T - name: LATIN CAPITAL LETTER T
U - name: LATIN CAPITAL LETTER U
V - name: LATIN CAPITAL LETTER V
W - name: LATIN CAPITAL LETTER W
X - name: LATIN CAPITAL LETTER X
Y - name: LATIN CAPITAL LETTER Y
Z - name: LATIN CAPITAL LETTER Z
[ - name: LEFT SQUARE BRACKET
\ - name: REVERSE SOLIDUS
] - name: RIGHT SQUARE BRACKET
^ - name: CIRCUMFLEX ACCENT
_ - name: LOW LINE
` - name: GRAVE ACCENT
a - name: LATIN SMALL LETTER A
b - name: LATIN SMALL LETTER B
c - name: LATIN SMALL LETTER C
d - name: LATIN SMALL LETTER D
e - name: LATIN SMALL LETTER E
f - name: LATIN SMALL LETTER F
g - name: LATIN SMALL LETTER G
h - name: LATIN SMALL LETTER H
i - name: LATIN SMALL LETTER I
j - name: LATIN SMALL LETTER J
k - name: LATIN SMALL LETTER K
l - name: LATIN SMALL LETTER L
m - name: LATIN SMALL LETTER M
n - name: LATIN SMALL LETTER N
o - name: LATIN SMALL LETTER O
p - name: LATIN SMALL LETTER P
q - name: LATIN SMALL LETTER Q
r - name: LATIN SMALL LETTER R
s - name: LATIN SMALL LETTER S
t - name: LATIN SMALL LETTER T
u - name: LATIN SMALL LETTER U
v - name: LATIN SMALL LETTER V
w - name: LATIN SMALL LETTER W
x - name: LATIN SMALL LETTER X
y - name: LATIN SMALL LETTER Y
z - name: LATIN SMALL LETTER Z
{ - name: LEFT CURLY BRACKET
| - name: VERTICAL LINE
} - name: RIGHT CURLY BRACKET
~ - name: TILDE
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.080 s -- in example.minvoker351.ExampleTest

but the reader chokes now:

Caused by: org.xml.sax.SAXParseException: Character reference "&#
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException (ErrorHandlerWrapper.java:204)
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError (ErrorHandlerWrapper.java:178)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError (XMLErrorReporter.java:399)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError (XMLErrorReporter.java:326)
    at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError (XMLScanner.java:1466)
    at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanCharReferenceValue (XMLScanner.java:1339)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next (XMLDocumentFragmentScannerImpl.java:3052)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next (XMLDocumentScannerImpl.java:601)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument (XMLDocumentFragmentScannerImpl.java:504)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse (XML11Configuration.java:841)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse (XML11Configuration.java:770)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse (XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse (AbstractSAXParser.java:1213)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse (SAXParserImpl.java:642)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse (SAXParserImpl.java:326)
    at org.apache.maven.plugins.surefire.report.TestSuiteXmlParser.parse (TestSuiteXmlParser.java:91)

See: https://stackoverflow.com/questions/55335528/fatal-error-character-reference-org-xml-sax-saxparseexception

@michael-o
Copy link
Member

And the value is indeed correct:

        Document doc2 = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new ByteArrayInputStream(w.toString().getBytes(StandardCharsets.UTF_8)));
        NodeList childNodes = doc2.getDocumentElement().getChildNodes();
        for (int i = 0; i < childNodes.getLength(); i++) {
        	System.out.println(childNodes.item(i).getTextContent());
        }

output:
[Fatal Error] :3:19: Zeichenreferenz "&#

Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 19; Zeichenreferenz "&#
	at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
	at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:338)
	at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
	at org.apache.maven.doxia.siterenderer.DefaultSiteRenderer.main(DefaultSiteRenderer.java:954)

While I don't understand that those values are serialized, but cannot be deserialized the best would be to drop the serialization of them silently in Plexus XML.

@elharo
Copy link
Contributor

elharo commented May 18, 2024

output:

Exception in thread "main" java.lang.IllegalStateException: character 0 is not allowed in output
	at org.codehaus.plexus.util.xml.pull.MXSerializer.writeElementContent(MXSerializer.java:947)
	at org.codehaus.plexus.util.xml.pull.MXSerializer.text(MXSerializer.java:780)
	at org.apache.maven.doxia.siterenderer.DefaultSiteRenderer.main(DefaultSiteRenderer.java:955)

.
Character 0, i,e. ASCII NUL, and a few others are deeply problematic as they can neither be included in XML documents nor escaped. If you're really trying to wrap arbitrary binary data into XML then you have to Base 64 encode it, though in this case we shouldn't be seeing those characters in the strings.

Otherwise you need some sort of application specific convention for how you'll encode the disallowed C0 controls. E.g. writing each one as an element like <c0 value="00"> or something like that. However you need to come up with that yourself. The XML parser/serializer won't help you there.

@michael-o
Copy link
Member

output:

Exception in thread "main" java.lang.IllegalStateException: character 0 is not allowed in output
	at org.codehaus.plexus.util.xml.pull.MXSerializer.writeElementContent(MXSerializer.java:947)
	at org.codehaus.plexus.util.xml.pull.MXSerializer.text(MXSerializer.java:780)
	at org.apache.maven.doxia.siterenderer.DefaultSiteRenderer.main(DefaultSiteRenderer.java:955)

. Character 0, i,e. ASCII NUL, and a few others are deeply problematic as they can neither be included in XML documents nor escaped. If you're really trying to wrap arbitrary binary data into XML then you have to Base 64 encode it, though in this case we shouldn't be seeing those characters in the strings.

Otherwise you need some sort of application specific convention for how you'll encode the disallowed C0 controls. E.g. writing each one as an element like <c0 value="00"> or something like that. However you need to come up with that yourself. The XML parser/serializer won't help you there.

In our case the easiest way would be to drop them, no?

@elharo
Copy link
Contributor

elharo commented May 18, 2024

I guess it depends on what it means when a NUL character or other C0 control appears here. It might be fine to drop them or replace each one with a space, especially if you don't care about round triipping.

Copy link
Member

@michael-o michael-o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slawekjaranowski Please create an issue with Plexus XML about the incomplete XML escaping and add the link the the spots where you escape. As soon as it is fixed those spots should be removed otherwise we will double escape. Otherwise the change is now fine for me.

@elharo
Copy link
Contributor

elharo commented May 19, 2024

For working around the plexus bugs, is it possible to add an additional step in between the build job and the serializer that escapes the message rather than passing escaped values into the build job?

@slawekjaranowski
Copy link
Member Author

@elharo BuildJob serializer is generated by modello - xpp3-writer so finally MXSerializer from plexus-utils is used

@michael-o
Copy link
Member

@elharo BuildJob serializer is generated by modello - xpp3-writer so finally MXSerializer from plexus-utils is used

Right, I don't see a a way to get in between layers without monkey patching.

@slawekjaranowski
Copy link
Member Author

@michael-o, @elharo I try to fix plexus-xml
Here only add an IT which generate special chars in outputs ...

@michael-o michael-o marked this pull request as ready for review May 22, 2024 07:10
@slawekjaranowski
Copy link
Member Author

Hehe jenkins will fail because build use a previous version of invoker .... and jenkins can not parse it ...

Failed to read test report file /home/jenkins/jenkins-home/workspace/_box_maven-invoker-plugin_PR-242@2/linux-jdk17-m3.6.x_build/target/invoker-reports/TEST-MINVOKER-351.xml
org.dom4j.DocumentException: Error on line 341 of document  : An invalid XML character (Unicode: 0x7) was found in the element content of the document.
	at org.dom4j.io.SAXReader.read(SAXReader.java:511)
	at org.dom4j.io.SAXReader.read(SAXReader.java:392)
	at hudson.tasks.junit.SuiteResult.parse(SuiteResult.java:186)
	at hudson.tasks.junit.TestResult.parse(TestResult.java:390)

so will be fixed after release

@michael-o michael-o self-requested a review May 22, 2024 09:51
@michael-o
Copy link
Member

Let me try something locally:

Copy link
Member

@michael-o michael-o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slawekjaranowski I have added some readibility improvement because the output was a bit confusing, but the test remains. I am fine with that and we need to fix Jenkins later.

@slawekjaranowski slawekjaranowski merged commit 74a87f6 into master May 22, 2024
50 checks passed
@slawekjaranowski slawekjaranowski deleted the MINVOKER-351 branch May 22, 2024 21:31
@slawekjaranowski slawekjaranowski added the bug Something isn't working label May 22, 2024
@michael-o
Copy link
Member

@slawekjaranowski Thanks for the great collaboration. We have achieved a awesome result here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants