Tags: TIBCOSoftware/snappydata
Tags
Removed "enterprise" from docs and comments also removed commented out portions for enterprise handling in build.gradle
Update version to 1.3.0-HF-1 for hotfix release - update submodule links to get the fix for GEODE-8029 - other relevant fixes ported from Geode: GEODE-1969, GEODE-5302, GEODE-8667, GEODE-9881, GEODE-9854 - fixed some test failures
1.3.0 RC2 release with the following major additions over RC1: - Column batch auto-compaction: Deletes in column tables cause the relevant rows in the column storage batches to be marked as deleted. If there are large number of deletions in a batch then this change will automatically compact the batch in the storage. Previous code only removed the batches if all rows have been deleted in the batch. The compaction happens in the foreground by the thread doing the delete. Likewise if there are a large number of updates to a batch then it will be compacted after a limit. There are two spark-level properties (i.e. has to be specified in conf/leads and servers with -snappydata... or as system property -D...) that can control the ratio at which compaction is triggered: - snappydata.column.compactionRatio: The ratio of deletes after which the batch will be compacted (default is 0.1) - snappydata.column.updateCompactionRatio: The ratio of updates to a column after which the batch will be compacted (default is 0.2) - Overflow of large results to disk on the executors: To handle OOM and related problems with queries having multi-GB results that get accumulated in memory of leads/servers, the new implementation will now dump large results to local disk files at the executors themselves. Instead of sending back results it will send back the Spark block IDs for those disk files to lead and back to the server which is servicing the client. When the iterator on the server finds a block ID instead of row, then it will fetch the disk block from the executor at that point and removes the disk block. So there is no longer any memory storage of large results anywhere rather only memory for one disk block is required temporarily on the server (32MB by default). Note that this works only if either connecting to the cluster using JDBC/ODBC (both SnappyData and Spark-hive server work) or when using SnappySession then one has to use SnappySession.sql().toLocalIterator(). Any intervening Dataset API usage will cause the Spark regular DataFrame to be used instead of the optimized DataFrame implementation that has this support. There are two spark-level properties that can alter the defaults: - spark.sql.maxMemoryResultSize: The maximum size of results in a partition on executor that will be held in memory after which it is dumped to disk and is expressed as a string with units like 4m or 4mb for megabytes (default is 4mb) while the maximum size of a single disk file is limited to 8X of this value (so 32mb by default) - spark.sql.resultPersistenceTimeout: The maximum time in seconds after completion of query for which the disk blocks are held in memory (default is 21600 i.e. 6 hours). The timeout property is required because the client can skip consuming the entire result set and stop the iteration midway so some disk blocks will remain forever if not cleared out. - Caching of meta-data of external tables: This is to fix the slow query problem for external tables having large number of partitions with disk files. The underlying issue is that with the large number of partitions, and each partition having a significant number of files, the total number of files is huge. Every query will first try to build the meta-data for all the existing parquet/orc/csv files (which may have changed since the last time due to new inserts for example) which takes all of the time reported in the queries. This change will cache the meta-data in the current session, so the first query on a connection or SnappySession will be slow but subsequent ones will be fast. The cache is automatically invalidated if there are new inserts or alterations to the external table from the same cluster. This brings an important caveat: if the external parquet/orc/csv data is altered in any way from outside the cluster then you must invalidate the cache in the cluster explicitly using "REFRESH TABLE <name>" SQL or SnappySession.catalog.refreshTable API else it will keep returning the results as per old meta-data. The caveat applied even before this change since the file/partition count is cached globally in the driver ExternalCatalog. - Clear broadcast blocks eagerly: This change will clear all the memory/disk blocks used in broadcast joins eagerly at the end of the query/DML operation. The previous Spark code clears the blocks on the executors only when GC hits on the driver and the broadcast object's weak reference is collected, so the blocks can be held unnecessarily on executors for quite a while thus increasing GC pressure. Apart from above there are few Spark patches that have been merged to address occasional problems reported due to those: SPARK-26352, SPARK-13747, SPARK-27065. Also in accordance with SPARK-13747, all instances of scala Await.ready/result calls in the code have been changed to use Spark's ThreadUtils.await methods instead.
PreviousNext