Skip to content

Commit

Permalink
commit-graph: implement generation data chunk
Browse files Browse the repository at this point in the history
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:

We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
  • Loading branch information
abhishekkumar2718 authored and gitster committed Jan 19, 2021
1 parent c1a0911 commit e8b6300
Show file tree
Hide file tree
Showing 10 changed files with 200 additions and 32 deletions.
114 changes: 103 additions & 11 deletions commit-graph.c
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
#define MAX_NUM_CHUNKS 7
#define MAX_NUM_CHUNKS 9

#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)

Expand All @@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)

#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)

/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)

Expand Down Expand Up @@ -394,6 +398,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
graph->chunk_commit_data = data + chunk_offset;
break;

case GRAPH_CHUNKID_GENERATION_DATA:
if (graph->chunk_generation_data)
chunk_repeated = 1;
else
graph->chunk_generation_data = data + chunk_offset;
break;

case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
if (graph->chunk_generation_data_overflow)
chunk_repeated = 1;
else
graph->chunk_generation_data_overflow = data + chunk_offset;
break;

case GRAPH_CHUNKID_EXTRAEDGES:
if (graph->chunk_extra_edges)
chunk_repeated = 1;
Expand Down Expand Up @@ -754,8 +772,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
{
const unsigned char *commit_data;
struct commit_graph_data *graph_data;
uint32_t lex_index;
uint64_t date_high, date_low;
uint32_t lex_index, offset_pos;
uint64_t date_high, date_low, offset;

while (pos < g->num_commits_in_base)
g = g->base_graph;
Expand All @@ -773,7 +791,19 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);

graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
if (g->chunk_generation_data) {
offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);

if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
if (!g->chunk_generation_data_overflow)
die(_("commit-graph requires overflow generation data but has none"));

offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
} else
graph_data->generation = item->date + offset;
} else
graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;

if (g->topo_levels)
*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
Expand Down Expand Up @@ -945,6 +975,7 @@ struct write_commit_graph_context {
struct oid_array oids;
struct packed_commit_list commits;
int num_extra_edges;
int num_generation_data_overflows;
unsigned long approx_nr_objects;
struct progress *progress;
int progress_done;
Expand All @@ -963,7 +994,8 @@ struct write_commit_graph_context {
report_progress:1,
split:1,
changed_paths:1,
order_by_pack:1;
order_by_pack:1,
write_generation_data:1;

struct topo_level_slab *topo_levels;
const struct commit_graph_opts *opts;
Expand Down Expand Up @@ -1123,6 +1155,45 @@ static int write_graph_chunk_data(struct hashfile *f,
return 0;
}

static int write_graph_chunk_generation_data(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
int i, num_generation_data_overflows = 0;

for (i = 0; i < ctx->commits.nr; i++) {
struct commit *c = ctx->commits.list[i];
timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
display_progress(ctx->progress, ++ctx->progress_cnt);

if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
num_generation_data_overflows++;
}

hashwrite_be32(f, offset);
}

return 0;
}

static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
int i;
for (i = 0; i < ctx->commits.nr; i++) {
struct commit *c = ctx->commits.list[i];
timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
display_progress(ctx->progress, ++ctx->progress_cnt);

if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
hashwrite_be32(f, offset >> 32);
hashwrite_be32(f, (uint32_t) offset);
}
}

return 0;
}

static int write_graph_chunk_extra_edges(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
Expand Down Expand Up @@ -1386,6 +1457,9 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (current->date && current->date > max_corrected_commit_date)
max_corrected_commit_date = current->date - 1;
commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;

if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
ctx->num_generation_data_overflows++;
}
}
}
Expand Down Expand Up @@ -1719,6 +1793,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
chunks[2].id = GRAPH_CHUNKID_DATA;
chunks[2].size = (hashsz + 16) * ctx->commits.nr;
chunks[2].write_fn = write_graph_chunk_data;

if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
ctx->write_generation_data = 0;
if (ctx->write_generation_data) {
chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
num_chunks++;
}
if (ctx->num_generation_data_overflows) {
chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
num_chunks++;
}
if (ctx->num_extra_edges) {
chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
chunks[num_chunks].size = 4 * ctx->num_extra_edges;
Expand Down Expand Up @@ -2139,6 +2228,8 @@ int write_commit_graph(struct object_directory *odb,
ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
ctx->opts = opts;
ctx->total_bloom_filter_data_size = 0;
ctx->write_generation_data = 1;
ctx->num_generation_data_overflows = 0;

bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
bloom_settings.bits_per_entry);
Expand Down Expand Up @@ -2445,16 +2536,17 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
continue;

/*
* If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
* our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
* extra logic in the following condition.
* If we are using topological level and one of our parents has
* generation GENERATION_NUMBER_V1_MAX, then our generation is
* also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
* in the following condition.
*/
if (max_generation == GENERATION_NUMBER_V1_MAX)
if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;

generation = commit_graph_generation(graph_commit);
if (generation != max_generation + 1)
graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
if (generation < max_generation + 1)
graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
oid_to_hex(&cur_oid),
generation,
max_generation + 1);
Expand Down
3 changes: 3 additions & 0 deletions commit-graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#include "oidset.h"

#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"

Expand Down Expand Up @@ -68,6 +69,8 @@ struct commit_graph {
const uint32_t *chunk_oid_fanout;
const unsigned char *chunk_oid_lookup;
const unsigned char *chunk_commit_data;
const unsigned char *chunk_generation_data;
const unsigned char *chunk_generation_data_overflow;
const unsigned char *chunk_extra_edges;
const unsigned char *chunk_base_graphs;
const unsigned char *chunk_bloom_indexes;
Expand Down
1 change: 1 addition & 0 deletions commit.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
#define GENERATION_NUMBER_ZERO 0
#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)

struct commit_list {
struct commit *item;
Expand Down
3 changes: 3 additions & 0 deletions t/README
Original file line number Diff line number Diff line change
Expand Up @@ -393,6 +393,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
be written after every 'git commit' command, and overrides the
'core.commitGraph' setting to true.

GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
commit-graph to be written without generation data chunk.

GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
commit-graph write to compute and write changed path Bloom filters for
every 'git commit-graph write', as if the `--changed-paths` option was
Expand Down
4 changes: 4 additions & 0 deletions t/helper/test-read-graph.c
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
printf(" oid_lookup");
if (graph->chunk_commit_data)
printf(" commit_metadata");
if (graph->chunk_generation_data)
printf(" generation_data");
if (graph->chunk_generation_data_overflow)
printf(" generation_data_overflow");
if (graph->chunk_extra_edges)
printf(" extra_edges");
if (graph->chunk_bloom_indexes)
Expand Down
4 changes: 2 additions & 2 deletions t/t4216-log-bloom.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
'

graph_read_expect () {
NUM_CHUNKS=5
NUM_CHUNKS=6
cat >expect <<- EOF
header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
num_commits: $1
chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
EOF
test-tool read-graph >actual &&
test_cmp expect actual
Expand Down
Loading

0 comments on commit e8b6300

Please sign in to comment.