Skip to content

sql/stats: inaccurate avg_size in statistics for column in single-column column family #137713

Open
@michae2

Description

Single-column column families are encoded using legacy marshaling. When we collect statistics on these single-column column families, we estimate their size using Datum.Size() instead of the actual on-disk size, and this can be very inaccurate. Here's an example:

CREATE TYPE e AS ENUM ('hello', 'howdy', 'hi');
CREATE TABLE abc (a e NOT NULL, b e NOT NULL, c e NOT NULL, FAMILY (a, b), FAMILY (c));
INSERT INTO abc VALUES ('hi', 'hi', 'hi');
ANALYZE abc;
SHOW STATISTICS FOR TABLE abc;

The avg_size of both a and b is 3 bytes, but avg_size of c is 48 bytes because DEnum.Size() includes a bunch of metadata that isn't written to disk.

demo@127.0.0.1:26257/demoapp/defaultdb> SHOW STATISTICS FOR TABLE abc;
  statistics_name | column_names |            created            | row_count | distinct_count | null_count | avg_size | partial_predicate |    histogram_id     | full_histogram_id
------------------+--------------+-------------------------------+-----------+----------------+------------+----------+-------------------+---------------------+--------------------
  NULL            | {a}          | 2024-12-18 17:25:46.427418+00 |         1 |              1 |          0 |        3 | NULL              | 1030462984780972033 |                 0
  NULL            | {b}          | 2024-12-18 17:25:46.427418+00 |         1 |              1 |          0 |        3 | NULL              | 1030462984784281601 |                 0
  NULL            | {c}          | 2024-12-18 17:25:46.427418+00 |         1 |              1 |          0 |       48 | NULL              | 1030462984787001345 |                 0
  NULL            | {rowid}      | 2024-12-18 17:25:46.427418+00 |         1 |              1 |          0 |        9 | NULL              | 1030462984777924609 |                 0
(4 rows)

Jira issue: CRDB-45715

Metadata

Assignees

No one assigned

    Labels

    A-sql-table-statsTable statistics (and their automatic refresh).C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-sql-queriesSQL Queries Teambranch-release-22.1Used to mark GA and release blockers, technical advisories, and bugs for 22.1branch-release-22.2Used to mark GA and release blockers, technical advisories, and bugs for 22.2branch-release-23.1Used to mark GA and release blockers, technical advisories, and bugs for 23.1branch-release-23.2Used to mark GA and release blockers, technical advisories, and bugs for 23.2branch-release-24.1Used to mark GA and release blockers, technical advisories, and bugs for 24.1branch-release-24.2Used to mark GA and release blockers, technical advisories, and bugs for 24.2branch-release-24.3Used to mark GA and release blockers, technical advisories, and bugs for 24.3

    Type

    No type

    Projects

    • Status

      Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions