From a37c265371dc861fa478dd63deaa38a86415fe3b Mon Sep 17 00:00:00 2001 From: Sean Owen Date: Thu, 7 Sep 2023 15:21:36 -0700 Subject: [PATCH] [SPARK-44732][XML][FOLLOWUP] Partial backport of spark-xml "Shortcut common type inference cases to fail fast" ### What changes were proposed in this pull request? Partial back-port of https://github.com/databricks/spark-xml/commit/994e357f7666956b5d0e63627716b2c092d9abbd?diff=split from spark-xml ### Why are the changes needed? Though no more development was intended on spark-xml, there was a non-trivial improvement to inference speed that I committed anyway to resolve a customer issue. Part of it can be 'backported' here to sync the code. I attached this as a follow-up to the main code port JIRA. There is still, in general, no intent to commit more to spark-xml in the meantime unless it's significantly important. ### Does this PR introduce _any_ user-facing change? No, this should only speed up schema inference without behavior change. ### How was this patch tested? Tested in spark-xml, and will be tested by tests here too Closes #42844 from srowen/SPARK-44732.2. Authored-by: Sean Owen Signed-off-by: Sean Owen --- .../apache/spark/sql/catalyst/xml/TypeCast.scala | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/TypeCast.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/TypeCast.scala index a00f372da7f60..b065dd41f28f8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/TypeCast.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/TypeCast.scala @@ -155,6 +155,12 @@ private[sql] object TypeCast { } else { value } + // A little shortcut to avoid trying many formatters in the common case that + // the input isn't a double. All built-in formats will start with a digit or period. + if (signSafeValue.isEmpty || + !(Character.isDigit(signSafeValue.head) || signSafeValue.head == '.')) { + return false + } // Rule out strings ending in D or F, as they will parse as double but should be disallowed if (value.nonEmpty && (value.last match { case 'd' | 'D' | 'f' | 'F' => true @@ -171,6 +177,11 @@ private[sql] object TypeCast { } else { value } + // A little shortcut to avoid trying many formatters in the common case that + // the input isn't a number. All built-in formats will start with a digit. + if (signSafeValue.isEmpty || !Character.isDigit(signSafeValue.head)) { + return false + } (allCatch opt signSafeValue.toInt).isDefined } @@ -180,6 +191,11 @@ private[sql] object TypeCast { } else { value } + // A little shortcut to avoid trying many formatters in the common case that + // the input isn't a number. All built-in formats will start with a digit. + if (signSafeValue.isEmpty || !Character.isDigit(signSafeValue.head)) { + return false + } (allCatch opt signSafeValue.toLong).isDefined }