Tags · TIBCOSoftware/snappydata

v1.3.1

Updated release notes for the latest changes

- also fixed version string in jar manifests
- update spark sub-module link

Jun 12, 2022
74dbd2f
zip
tar.gz
Notes
Downloads

v1.3.0

Improved quickstart

- improved Quickstart.scala
- minor fix in IntervalExpression
- added docs/.gitignore to remove generated apidocs and copied reference/sql_functions

Oct 27, 2021
05ebab4
zip
tar.gz
Notes
Downloads

v1.3.0-rc2

1.3.0 RC2 release with the following major additions over RC1:

- Column batch auto-compaction: Deletes in column tables cause the relevant rows in the column
storage batches to be marked as deleted. If there are large number of deletions in a batch
then this change will automatically compact the batch in the storage. Previous code only
removed the batches if all rows have been deleted in the batch. The compaction happens in
the foreground by the thread doing the delete. Likewise if there are a large number of
updates to a batch then it will be compacted after a limit. There are two spark-level
properties (i.e. has to be specified in conf/leads and servers with -snappydata... or
as system property -D...) that can control the ratio at which compaction is triggered:
- snappydata.column.compactionRatio: The ratio of deletes after which the batch will be
compacted (default is 0.1)
- snappydata.column.updateCompactionRatio: The ratio of updates to a column after which
the batch will be compacted (default is 0.2)

- Overflow of large results to disk on the executors: To handle OOM and related problems
with queries having multi-GB results that get accumulated in memory of leads/servers,
the new implementation will now dump large results to local disk files at the executors
themselves. Instead of sending back results it will send back the Spark block IDs for
those disk files to lead and back to the server which is servicing the client.
When the iterator on the server finds a block ID instead of row, then it will fetch the
disk block from the executor at that point and removes the disk block. So there is no longer
any memory storage of large results anywhere rather only memory for one disk block is
required temporarily on the server (32MB by default). Note that this works only if either
connecting to the cluster using JDBC/ODBC (both SnappyData and Spark-hive server work)
or when using SnappySession then one has to use SnappySession.sql().toLocalIterator().
Any intervening Dataset API usage will cause the Spark regular DataFrame to be used instead
of the optimized DataFrame implementation that has this support. There are two spark-level
properties that can alter the defaults:
- spark.sql.maxMemoryResultSize: The maximum size of results in a partition on executor
that will be held in memory after which it is dumped to disk and is expressed as a
string with units like 4m or 4mb for megabytes (default is 4mb) while the maximum size
of a single disk file is limited to 8X of this value (so 32mb by default)
- spark.sql.resultPersistenceTimeout: The maximum time in seconds after completion of query
for which the disk blocks are held in memory (default is 21600 i.e. 6 hours). The timeout
property is required because the client can skip consuming the entire result set and stop
the iteration midway so some disk blocks will remain forever if not cleared out.

- Caching of meta-data of external tables: This is to fix the slow query problem for external
tables having large number of partitions with disk files. The underlying issue is that with
the large number of partitions, and each partition having a significant number of files,
the total number of files is huge. Every query will first try to build the meta-data for all
the existing parquet/orc/csv files (which may have changed since the last time due to new
inserts for example) which takes all of the time reported in the queries. This change
will cache the meta-data in the current session, so the first query on a connection or
SnappySession will be slow but subsequent ones will be fast. The cache is automatically
invalidated if there are new inserts or alterations to the external table from the same
cluster. This brings an important caveat: if the external parquet/orc/csv data is altered
in any way from outside the cluster then you must invalidate the cache in the cluster
explicitly using "REFRESH TABLE <name>" SQL or SnappySession.catalog.refreshTable API else
it will keep returning the results as per old meta-data. The caveat applied even before
this change since the file/partition count is cached globally in the driver ExternalCatalog.

- Clear broadcast blocks eagerly: This change will clear all the memory/disk blocks used in
broadcast joins eagerly at the end of the query/DML operation. The previous Spark code
clears the blocks on the executors only when GC hits on the driver and the broadcast
object's weak reference is collected, so the blocks can be held unnecessarily on executors
for quite a while thus increasing GC pressure.

Apart from above there are few Spark patches that have been merged to address occasional
problems reported due to those: SPARK-26352, SPARK-13747, SPARK-27065. Also in accordance with
SPARK-13747, all instances of scala Await.ready/result calls in the code have been changed to
use Spark's ThreadUtils.await methods instead.

Oct 6, 2021
47e1103
zip
tar.gz

v1.2.0

* Version changes in docs files.

Jan 15, 2020
340a4a7
zip
tar.gz
Notes
Downloads

v1.1.1

Link with latest store

Aug 23, 2019
e08c56e
zip
tar.gz
Notes
Downloads

v1.1.0

* Updated docs with upcoming release version 1.1.0

* Select product name depending upon enterprise flag, to be printed for 'snappy version' command.
* Discussed with @kneeraj
* Link to the latest store commit.

May 10, 2019
2ddc58b
zip
tar.gz
Notes
Downloads

v1.0.2.1

* Update the Zeppelin interpreter version.

Nov 6, 2018
0ad6b98
zip
tar.gz
Notes
Downloads

v1.0.2

* Corrected Zeppelin interpreter version.

Aug 27, 2018
b5b78b3
zip
tar.gz
Notes
Downloads

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3.1-HF-1

v1.3.0-HF-1