sql/stats: inaccurate avg_size in statistics for column in single-column column family #137713
Open
Description
Single-column column families are encoded using legacy marshaling. When we collect statistics on these single-column column families, we estimate their size using Datum.Size() instead of the actual on-disk size, and this can be very inaccurate. Here's an example:
CREATE TYPE e AS ENUM ('hello', 'howdy', 'hi');
CREATE TABLE abc (a e NOT NULL, b e NOT NULL, c e NOT NULL, FAMILY (a, b), FAMILY (c));
INSERT INTO abc VALUES ('hi', 'hi', 'hi');
ANALYZE abc;
SHOW STATISTICS FOR TABLE abc;
The avg_size of both a
and b
is 3 bytes, but avg_size of c
is 48 bytes because DEnum.Size() includes a bunch of metadata that isn't written to disk.
demo@127.0.0.1:26257/demoapp/defaultdb> SHOW STATISTICS FOR TABLE abc;
statistics_name | column_names | created | row_count | distinct_count | null_count | avg_size | partial_predicate | histogram_id | full_histogram_id
------------------+--------------+-------------------------------+-----------+----------------+------------+----------+-------------------+---------------------+--------------------
NULL | {a} | 2024-12-18 17:25:46.427418+00 | 1 | 1 | 0 | 3 | NULL | 1030462984780972033 | 0
NULL | {b} | 2024-12-18 17:25:46.427418+00 | 1 | 1 | 0 | 3 | NULL | 1030462984784281601 | 0
NULL | {c} | 2024-12-18 17:25:46.427418+00 | 1 | 1 | 0 | 48 | NULL | 1030462984787001345 | 0
NULL | {rowid} | 2024-12-18 17:25:46.427418+00 | 1 | 1 | 0 | 9 | NULL | 1030462984777924609 | 0
(4 rows)
Jira issue: CRDB-45715
Metadata
Assignees
Labels
Table statistics (and their automatic refresh).Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.SQL Queries TeamUsed to mark GA and release blockers, technical advisories, and bugs for 22.1Used to mark GA and release blockers, technical advisories, and bugs for 22.2Used to mark GA and release blockers, technical advisories, and bugs for 23.1Used to mark GA and release blockers, technical advisories, and bugs for 23.2Used to mark GA and release blockers, technical advisories, and bugs for 24.1Used to mark GA and release blockers, technical advisories, and bugs for 24.2Used to mark GA and release blockers, technical advisories, and bugs for 24.3
Type
Projects
Status
Backlog