Correct some doc styles to pass mvn verification

to include the Python configs in config doc generation to update the file `configs.md`.
NVIDIA · firestarman · Sep 11, 2020 · Jul 21, 2020 · Jul 21, 2020 · Jul 22, 2020
commit 3f94ac8b608e311c181892fc72756d894627037f
diff --git a/docs/configs.md b/docs/configs.md
@@ -37,6 +37,11 @@ Name | Description | Default Value
 <a name="memory.host.spillStorageSize"></a>spark.rapids.memory.host.spillStorageSize|Amount of off-heap host memory to use for buffering spilled GPU data before spilling to local disk|1073741824
 <a name="memory.pinnedPool.size"></a>spark.rapids.memory.pinnedPool.size|The size of the pinned memory pool in bytes unless otherwise specified. Use 0 to disable the pool.|0
 <a name="memory.uvm.enabled"></a>spark.rapids.memory.uvm.enabled|UVM or universal memory can allow main host memory to act essentially as swap for device(GPU) memory. This allows the GPU to process more data than fits in memory, but can result in slower processing. This is an experimental feature.|false
+<a name="python.concurrentPythonWorkers"></a>spark.rapids.python.concurrentPythonWorkers|Set the number of Python worker processes that can execute concurrently per GPU. Python worker processes may temporarily block when the number of concurrent Python worker processes started by the same executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors. >0 means enabled, while <=0 means unlimited|0
+<a name="python.memory.gpu.allocFraction"></a>spark.rapids.python.memory.gpu.allocFraction|The fraction of total GPU memory that should be initially allocated for pooled memory for all the Python workers. It supposes to be less than (1 - $(spark.rapids.memory.gpu.allocFraction)), since the executor will share the GPU with its owning Python workers.|None
+<a name="python.memory.gpu.maxAllocFraction"></a>spark.rapids.python.memory.gpu.maxAllocFraction|The fraction of total GPU memory that limits the maximum size of the RMM pool for all the Python workers. It supposes to be less than (1 - $(spark.rapids.memory.gpu.maxAllocFraction)), since the executor will share the GPU with its owning Python workers. when setting to 0 means no limit.|0.0
+<a name="python.memory.gpu.pooling.enabled"></a>spark.rapids.python.memory.gpu.pooling.enabled|Should RMM in Python workers act as a pooling allocator for GPU memory, or should it just pass through to CUDA memory allocation directly.|None
+<a name="python.memory.uvm.enabled"></a>spark.rapids.python.memory.uvm.enabled|Similar with `spark.rapids.python.memory.uvm.enabled`, but this conf is for python workers. This is an experimental feature.|None
 <a name="shuffle.transport.enabled"></a>spark.rapids.shuffle.transport.enabled|When set to true, enable the Rapids Shuffle Transport for accelerated shuffle.|false
 <a name="shuffle.transport.maxReceiveInflightBytes"></a>spark.rapids.shuffle.transport.maxReceiveInflightBytes|Maximum aggregate amount of bytes that be fetched at any given time from peers during shuffle|1073741824
 <a name="shuffle.ucx.managementServerHost"></a>spark.rapids.shuffle.ucx.managementServerHost|The host to be used to start the management server|null
@@ -252,7 +257,12 @@ Name | Description | Default Value | Notes
 <a name="sql.exec.CartesianProductExec"></a>spark.rapids.sql.exec.CartesianProductExec|Implementation of join using brute force|false|This is disabled by default because large joins can cause out of memory errors|
 <a name="sql.exec.ShuffledHashJoinExec"></a>spark.rapids.sql.exec.ShuffledHashJoinExec|Implementation of join using hashed shuffled data|true|None|
 <a name="sql.exec.SortMergeJoinExec"></a>spark.rapids.sql.exec.SortMergeJoinExec|Sort merge join, replacing with shuffled hash join|true|None|
-<a name="sql.exec.ArrowEvalPythonExec"></a>spark.rapids.sql.exec.ArrowEvalPythonExec|Runs python UDFs.  The python code does not actually run on the GPU, but the transfer of data between the python process and the java process is accelerated.|false|This is disabled by default because Performance is not ideal for UDFs that take a long time.|
+<a name="sql.exec.AggregateInPandasExec"></a>spark.rapids.sql.exec.AggregateInPandasExec|The backend for Grouped Aggregation Pandas UDF, it runs on CPU itself now but supports running the Python UDFs code on GPU when calling cuDF APIs in the UDF|false|This is disabled by default because Performance is not ideal now|
+<a name="sql.exec.ArrowEvalPythonExec"></a>spark.rapids.sql.exec.ArrowEvalPythonExec|The backend of the Scalar Pandas UDFs, it supports running the Python UDFs code on GPU when calling cuDF APIs in the UDF, also accelerates the data transfer between the Java process and Python process|false|This is disabled by default because Performance is not ideal for UDFs that take a long time|
+<a name="sql.exec.FlatMapCoGroupsInPandasExec"></a>spark.rapids.sql.exec.FlatMapCoGroupsInPandasExec|The backend for CoGrouped Aggregation Pandas UDF, it runs on CPU itself now but supports running the Python UDFs code on GPU when calling cuDF APIs in the UDF|false|This is disabled by default because Performance is not ideal now|
+<a name="sql.exec.FlatMapGroupsInPandasExec"></a>spark.rapids.sql.exec.FlatMapGroupsInPandasExec|The backend for Grouped Map Pandas UDF, it runs on CPU itself now but supports running the Python UDFs code on GPU when calling cuDF APIs in the UDF|false|This is disabled by default because Performance is not ideal now|
+<a name="sql.exec.MapInPandasExec"></a>spark.rapids.sql.exec.MapInPandasExec|The backend for Map Pandas Iterator UDF, it runs on CPU itself now but supports running the Python UDFs code on GPU when calling cuDF APIs in the UDF|false|This is disabled by default because Performance is not ideal now|
+<a name="sql.exec.WindowInPandasExec"></a>spark.rapids.sql.exec.WindowInPandasExec|The backend for Pandas UDF with window functions, it runs on CPU itself now but supports running the Python UDFs code on GPU when calling cuDF APIs in the UDF|false|This is disabled by default because Performance is not ideal now|
 <a name="sql.exec.WindowExec"></a>spark.rapids.sql.exec.WindowExec|Window-operator backend|true|None|
 
 ### Scans

diff --git a/python/rapids/worker.py b/python/rapids/worker.py
@@ -51,9 +51,8 @@ def initialize_gpu_mem():
         base_t = rmm.mr.ManagedMemoryResource if uvm_enabled else rmm.mr.CudaMemoryResource
         rmm.mr.set_current_device_resource(rmm.mr.PoolMemoryResource(base_t(), pool_size, pool_max_size))
     elif uvm_enabled:
-        # Will this really be needed for Python ?
         from cudf import rmm
-        rmm.mr.set_default_resource(rmm.mr.ManagedMemoryResource())
+        rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())
     else:
         # Do nothing, whether to use RMM (default mode) or not depends on UDF definition.
         pass

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
@@ -1672,8 +1672,9 @@ object GpuOverrides {
             GpuLocalLimitExec(localLimitExec.limit, childPlans(0).convertIfNeeded())
         }),
     exec[ArrowEvalPythonExec](
-      "Runs python UDFs.  The python code does not actually run on the GPU, but the " +
-          "transfer of data between the python process and the java process is accelerated.",
+      "The backend of the Scalar Pandas UDFs, it supports running the Python UDFs code on GPU" +
+        " when calling cuDF APIs in the UDF, also accelerates the data transfer between the" +
+        " Java process and Python process",
       (e, conf, p, r) =>
         new SparkPlanMeta[ArrowEvalPythonExec](e, conf, p, r) {
           val udfs: Seq[BaseExprMeta[PythonUDF]] =
@@ -1772,23 +1773,28 @@ object GpuOverrides {
         }
       }),
     exec[MapInPandasExec](
-      "The backend for Map Pandas Iterator UDF",
+      "The backend for Map Pandas Iterator UDF, it runs on CPU itself now but supports running" +
+        " the Python UDFs code on GPU when calling cuDF APIs in the UDF",
       (mapPy, conf, p, r) => new GpuMapInPandasExecMeta(mapPy, conf, p, r))
         .disabledByDefault("Performance is not ideal now"),
     exec[FlatMapGroupsInPandasExec](
-      "The backend for Grouped Map Pandas UDF",
+      "The backend for Grouped Map Pandas UDF, it runs on CPU itself now but supports running" +
+        " the Python UDFs code on GPU when calling cuDF APIs in the UDF",
       (flatPy, conf, p, r) => new GpuFlatMapGroupsInPandasExecMeta(flatPy, conf, p, r))
         .disabledByDefault("Performance is not ideal now"),
     exec[AggregateInPandasExec](
-      "The backend for Grouped Aggregation Pandas UDF",
+      "The backend for Grouped Aggregation Pandas UDF, it runs on CPU itself now but supports" +
+        " running the Python UDFs code on GPU when calling cuDF APIs in the UDF",
       (aggPy, conf, p, r) => new GpuAggregateInPandasExecMeta(aggPy, conf, p, r))
         .disabledByDefault("Performance is not ideal now"),
     exec[FlatMapCoGroupsInPandasExec](
-      "The backend for CoGrouped Aggregation Pandas UDF",
+      "The backend for CoGrouped Aggregation Pandas UDF, it runs on CPU itself now but supports" +
+        " running the Python UDFs code on GPU when calling cuDF APIs in the UDF",
       (flatCoPy, conf, p, r) => new GpuFlatMapCoGroupsInPandasExecMeta(flatCoPy, conf, p, r))
         .disabledByDefault("Performance is not ideal now"),
     exec[WindowInPandasExec](
-      "The backend for Pandas UDF with window functions",
+      "The backend for Pandas UDF with window functions, it runs on CPU itself now but supports" +
+        " running the Python UDFs code on GPU when calling cuDF APIs in the UDF",
       (winPy, conf, p, r) => new GpuWindowInPandasExecMeta(winPy, conf, p, r))
         .disabledByDefault("Performance is not ideal now")
   ).map(r => (r.getClassFor.asSubclass(classOf[SparkPlan]), r)).toMap

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
@@ -781,6 +781,8 @@ object RapidsConf {
     }
   }
   def main(args: Array[String]): Unit = {
+    // Include the configs in PythonConfEntries
+    com.nvidia.spark.rapids.python.PythonConfEntries.init()
     val out = new FileOutputStream(new File(args(0)))
     Console.withOut(out) {
       Console.withErr(out) {

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/python/PythonConfEntries.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/python/PythonConfEntries.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -59,4 +59,8 @@ object PythonConfEntries {
       "python workers. This is an experimental feature.")
     .booleanConf
     .createOptional
+
+  // An empty function called by RapidsConf to initialize the config definitions above for
+  // doc generation
+  def init(): Unit = {}
 }
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/python/PythonWorkerSemaphore.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/python/PythonWorkerSemaphore.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -22,19 +22,19 @@ import com.nvidia.spark.rapids.python.PythonConfEntries.CONCURRENT_PYTHON_WORKER
 import org.apache.spark.{SparkEnv, TaskContext}
 import org.apache.spark.internal.Logging
 
-/**
-  * PythonWorkerSemaphore is used to limit the number of Python workers(processes) to be started
-  * by an executor.
-  *
-  * This PythonWorkerSemaphore will not initialize the GPU, different from GpuSemaphore. Since
-  * tasks calling the API `acquireIfNecessary` are supposed not to use the GPU directly, but
-  * delegate the permits to the Python workers respectively.
-  *
-  * Call `acquireIfNecessary` or `releaseIfNecessary` directly when needed, since the inner
-  * semaphore will be initialized implicitly, but need to call `shutdown` explicitly to release
-  * the inner semaphore when no longer needed.
-  *
-  */
+/*
+ * PythonWorkerSemaphore is used to limit the number of Python workers(processes) to be started
+ * by an executor.
+ *
+ * This PythonWorkerSemaphore will not initialize the GPU, different from GpuSemaphore. Since
+ * tasks calling the API `acquireIfNecessary` are supposed not to use the GPU directly, but
+ * delegate the permits to the Python workers respectively.
+ *
+ * Call `acquireIfNecessary` or `releaseIfNecessary` directly when needed, since the inner
+ * semaphore will be initialized implicitly, but need to call `shutdown` explicitly to release
+ * the inner semaphore when no longer needed.
+ *
+ */
 object PythonWorkerSemaphore extends Logging {
 
   private lazy val workersPerGpu = new RapidsConf(SparkEnv.get.conf)
@@ -57,32 +57,32 @@ object PythonWorkerSemaphore extends Logging {
     instance
   }
 
-  /**
-    * Tasks must call this when they begin to start a Python worker who will use GPU.
-    * If the task has not already acquired the GPU semaphore then it is acquired,
-    * blocking if necessary.
-    * NOTE: A task completion listener will automatically be installed to ensure
-    *       the semaphore is always released by the time the task completes.
-    */
+  /*
+   * Tasks must call this when they begin to start a Python worker who will use GPU.
+   * If the task has not already acquired the GPU semaphore then it is acquired,
+   * blocking if necessary.
+   * NOTE: A task completion listener will automatically be installed to ensure
+   *       the semaphore is always released by the time the task completes.
+   */
   def acquireIfNecessary(context: TaskContext): Unit = {
     if (enabled && context != null) {
       getInstance.acquireIfNecessary(context)
     }
   }
 
-  /**
-    * Tasks must call this when they are finished using the GPU.
-    */
+  /*
+   * Tasks must call this when they are finished using the GPU.
+   */
   def releaseIfNecessary(context: TaskContext): Unit = {
     if (enabled && context != null) {
       getInstance.releaseIfNecessary(context)
     }
   }
 
-  /**
-    * Release the inner semaphore.
-    * NOTE: This does not wait for active tasks to release!
-    */
+  /*
+   * Release the inner semaphore.
+   * NOTE: This does not wait for active tasks to release!
+   */
   def shutdown(): Unit = synchronized {
     if (instance != null) {
       instance.shutdown()

diff --git a/sql-plugin/src/main/scala/org/apache/spark/sql/execution/python/rapids/GpuPandasUtils.scala b/sql-plugin/src/main/scala/org/apache/spark/sql/execution/python/rapids/GpuPandasUtils.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -23,7 +23,7 @@ import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.execution.python.PandasGroupUtils
 import org.apache.spark.sql.vectorized.ColumnarBatch
 
-/**
+/*
  * This is to expose the APIs of PandasGroupUtils to rapids Execs
  */
 private[sql] object GpuPandasUtils {

diff --git a/...rc/main/scala/org/apache/spark/sql/rapids/execution/python/GpuAggregateInPandasExec.scala b/...rc/main/scala/org/apache/spark/sql/rapids/execution/python/GpuAggregateInPandasExec.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -20,7 +20,6 @@ import java.io.File
 
 import com.nvidia.spark.rapids._
 import com.nvidia.spark.rapids.python.PythonWorkerSemaphore
-
 import scala.collection.mutable.ArrayBuffer
 
 import org.apache.spark.{SparkEnv, TaskContext}
@@ -61,7 +60,7 @@ class GpuAggregateInPandasExecMeta(
     )
 }
 
-/**
+/*
  * This GpuAggregateInPandasExec aims at supporting running Pandas UDF code
  * on GPU at Python side.
  *
@@ -77,8 +76,7 @@ case class GpuAggregateInPandasExec(
 
   override def supportsColumnar = false
   override def doExecuteColumnar(): RDD[ColumnarBatch] = {
-    // TBD
-    super.doExecuteColumnar()
+    throw new IllegalStateException(s"Columnar execution is not supported by $this yet")
   }
 
   // Most code is copied from AggregateInPandasExec, except two GPU related calls

diff --git a/.../src/main/scala/org/apache/spark/sql/rapids/execution/python/GpuArrowEvalPythonExec.scala b/.../src/main/scala/org/apache/spark/sql/rapids/execution/python/GpuArrowEvalPythonExec.scala
@@ -1,4 +1,6 @@
 /*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
  * this work for additional information regarding copyright ownership.
@@ -22,26 +24,27 @@ import java.net.Socket
 import java.util.concurrent.atomic.AtomicBoolean
 
 import ai.rapids.cudf._
-import com.nvidia.spark.rapids.GpuMetricNames._
 import com.nvidia.spark.rapids._
+import com.nvidia.spark.rapids.GpuMetricNames._
 import com.nvidia.spark.rapids.python.PythonWorkerSemaphore
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.{SparkEnv, TaskContext}
 import org.apache.spark.api.python._
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.util.toPrettySQL
+import org.apache.spark.sql.execution.{SparkPlan, UnaryExecNode}
 import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
 import org.apache.spark.sql.execution.python.PythonUDFRunner
-import org.apache.spark.sql.execution.{SparkPlan, UnaryExecNode}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.{DataType, StructField, StructType}
 import org.apache.spark.sql.util.ArrowUtils
 import org.apache.spark.sql.vectorized.ColumnarBatch
 import org.apache.spark.util.Utils
-import org.apache.spark.{SparkEnv, TaskContext}
 
-import scala.collection.mutable
-import scala.collection.mutable.ArrayBuffer
 
 class RebatchingIterator(
     wrapped: Iterator[ColumnarBatch],
@@ -131,7 +134,7 @@ class BatchQueue extends AutoCloseable {
   }
 }
 
-/**
+/*
  * Helper functions for [[GpuPythonUDF]]
  */
 object GpuPythonUDF {
@@ -155,7 +158,7 @@ object GpuPythonUDF {
   def isWindowPandasUDF(e: Expression): Boolean = isGroupedAggPandasUDF(e)
 }
 
-/**
+/*
  * A serialized version of a Python lambda function. This is a special expression, which needs a
  * dedicated physical operator to execute it, and thus can't be pushed down to data sources.
  */
@@ -185,7 +188,7 @@ case class GpuPythonUDF(
   }
 }
 
-/**
+/*
  * A trait that can be mixed-in with `BasePythonRunner`. It implements the logic from
  * Python (Arrow) to GPU/JVM (ColumnarBatch).
  */
@@ -256,7 +259,7 @@ trait GpuPythonArrowOutput extends Arm { self: BasePythonRunner[_, ColumnarBatch
 }
 
 
-/**
+/*
  * Similar to `PythonUDFRunner`, but exchange data with Python worker via Arrow stream.
  */
 class GpuArrowPythonRunner(
@@ -366,7 +369,7 @@ class StreamToBufferProvider(inputStream: DataInputStream) extends HostBufferPro
   }
 }
 
-/**
+/*
  * A physical plan that evaluates a [[GpuPythonUDF]]. The transformation of the data to arrow
  * happens on the GPU (practically a noop), But execution of the UDFs are on the CPU or GPU.
  */

diff --git a/...n/scala/org/apache/spark/sql/rapids/execution/python/GpuFlatMapCoGroupsInPandasExec.scala b/...n/scala/org/apache/spark/sql/rapids/execution/python/GpuFlatMapCoGroupsInPandasExec.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -58,7 +58,7 @@ class GpuFlatMapCoGroupsInPandasExecMeta(
     )
 }
 
-/**
+/*
  *
  * This GpuFlatMapCoGroupsInPandasExec aims at supporting running Pandas functional code
  * on GPU at Python side.
@@ -77,8 +77,7 @@ case class GpuFlatMapCoGroupsInPandasExec(
 
   override def supportsColumnar = false
   override def doExecuteColumnar(): RDD[ColumnarBatch] = {
-    // TBD
-    super.doExecuteColumnar()
+    throw new IllegalStateException(s"Columnar execution is not supported by $this yet")
   }
 
   // Most code is copied from FlatMapCoGroupsInPandasExec, except two GPU related calls

diff --git a/...ain/scala/org/apache/spark/sql/rapids/execution/python/GpuFlatMapGroupsInPandasExec.scala b/...ain/scala/org/apache/spark/sql/rapids/execution/python/GpuFlatMapGroupsInPandasExec.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ * Copyright (c) 2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -57,7 +57,7 @@ class GpuFlatMapGroupsInPandasExecMeta(
     )
 }
 
-/**
+/*
  *
  * This GpuFlatMapGroupsInPandasExec aims at supporting running Pandas functional code
  * on GPU at Python side.
@@ -74,8 +74,7 @@ case class GpuFlatMapGroupsInPandasExec(
 
   override def supportsColumnar = false
   override def doExecuteColumnar(): RDD[ColumnarBatch] = {
-    // TBD
-    super.doExecuteColumnar()
+    throw new IllegalStateException(s"Columnar execution is not supported by $this yet")
   }
 
   // Most code is copied from FlatMapGroupsInPandasExec, except two GPU related calls