Skip to content

Commit

Permalink
Move documentation images into a separate folder
Browse files Browse the repository at this point in the history
  • Loading branch information
akorotkov committed Aug 26, 2022
1 parent 048a90c commit 6a41bc5
Show file tree
Hide file tree
Showing 50 changed files with 49 additions and 49 deletions.
36 changes: 18 additions & 18 deletions doc/arch.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ OrioleDB stored data in index-organized tables. Consider the example of the `ve

The data structure behind the `vegetables` table is given below. Both primary and secondary indexes comprise variations of B+-tree. Leaf tuples are marked blue, non-leaf tuples are marked green, high-keys are marked yellow. The primary index's non-leaf tuples and high keys consist of the primary key columns; leaf tuples are table rows themselves. Non-leaf and leaf tuples, high keys of the secondary index consists of secondary index columns and primary index columns.

![Indexes of vegetables table](vegetables_indexes.svg)
![Indexes of vegetables table](images/vegetables_indexes.svg)

The primary index and secondary index are not tied together; they are only connected by logical values persisting in table rows. OrioleDB automatically creates a primary index on the virtual `ctid` column when no primary index is given.

Expand All @@ -26,15 +26,15 @@ OrioleDB uses a so-called `dual pointers` scheme to evade buffer mapping and cor

The diagram below shows the simplified structure of the OrioleDB B-tree. This diagram does not detail the page contents: main memory pages are numbered the same as their storage prototypes. Arrows depict downlinks.

![Dual pointers scheme](dual_pointers.svg)
![Dual pointers scheme](images/dual_pointers.svg)

The main memory page could refer to both main memory and storage pages. However, storage pages can refer to storage pages only. Therefore, besides main memory and storage pages `1`, `2` and `3` are marked the same as storage pages `1`, `2` and `3`, their contents are different in the above matter.

In order to implement this scheme, we have to sacrifice rightlinks. That would be too complex (and slow?) to toggle downlinks and rightlinks between in-memory and storage pointers.

Technically, OrioleDB B-trees still contain rightlinks, but they have only temporary usage during page splits. Righlink exists only between splitting a new page and insertion downlink to the parent. Therefore, if the completed split happens concurrently with locating a tree page, one must retry from the parent (see find_page() function). Stepping tree pages right and left become more complex too. Instead of using rightlinks (and leftlinks) one have to find siblings from parent (see find_right_page() and find_left_page()). However, this complexity is more than justified by better vertical scalability.

See [concurrency algorithms in OrioleDB B-tree](concurrency.md) for details.
See [concurrency algorithms in OrioleDB B-tree](images/concurrency.md) for details.

Page structure
--------------
Expand All @@ -51,7 +51,7 @@ See `src/btree/page_state.c` for defailts.

The tuples on the page are split into chunks. There is also an area of high keys, where each chunk has an associated high key. The high key of the last chunk is simultaneously the high key of the page. Thank this, if one needs a particular tuple on the page, he does not need to copy the whole page. It is enough to copy the high keys area, find the appropriate page chunk, and copy it. `PartialPageState` structure is responsible for tracking partially read pages.

![Page strucutre](page_structure.svg)
![Page strucutre](images/page_structure.svg)

In the future, we plan to get rid co copying in the majority of page access patterns and implement vectorization for faster search within the page.

Expand All @@ -62,50 +62,50 @@ OrioleDB utilized copy-on-write checkpoints and row-level WAL. The example belo

For example, page number `7` was modified. It was marked as `7*`.

![Copy-on-write checkpoint 1](cow_1.svg)
![Copy-on-write checkpoint 1](images/cow_1.svg)

Checkpoint has written `7*` to the storage. It has written to the free space according to the copy-on-write principle. When checkpoint considers writing a non-leaf page, it replaces in-memory downlinks with storage ones. Therefore, page `3` is also considered modified because we need to reference the new `7*` from the storage page. So, page 3*` is also written to the free storage space. Similar to `1*`.

![Copy-on-write checkpoint 2](cow_2.svg)
![Copy-on-write checkpoint 2](images/cow_2.svg)

Once the checkpoint is completed, old storage page images `1`, `3`, and `7` are marked as free space.

![Copy-on-write checkpoint 3](cow_3.svg)
![Copy-on-write checkpoint 3](images/cow_3.svg)

Therefore, a consistent tree image exists in storage every moment.

OrioleDB supports fuzzy checkpointing. That is, we allow tree modification concurrent to checkpointing. That is essential because too fast or frequent checkpoints could cause a write flood.

Consider the following example. The tree contains pages 1 – 7`. Pages 1 – 6` are present in both main memory and storage (checkpoint 1), while pages 7 are present in storage only. Page `4` was modified (`4*`).

![Concurrent checkpoint 1](checkpoint_concurrent_1.svg)
![Concurrent checkpoint 1](images/checkpoint_concurrent_1.svg)

Checkpointing was started to traverse the tree from left to right, and it passed the subtree of pages `2`, `4` and `5`. Page images `4*` and `2*` were written to checkpoint 2.

![Concurrent checkpoint 2](checkpoint_concurrent_2.svg)
![Concurrent checkpoint 2](images/checkpoint_concurrent_2.svg)

Concurrently, page 5 was modified (`5*`). Background writer wrote page image `5*`. Page `5` belongs to the tree part, which is already processed by checkpointer. That is why we cannot write it to checkpoint 2, because it can affect its consistency.

Page number `7` was also concurrently modified but was not written by a background writer.

![Concurrent checkpoint 3](checkpoint_concurrent_3.svg)
![Concurrent checkpoint 3](images/checkpoint_concurrent_3.svg)

Then, checkpointer finished writing page images `7*`, `3*` and `1*` to checkpoint 2.

![Concurrent checkpoint 4](checkpoint_concurrent_4.svg)
![Concurrent checkpoint 4](images/checkpoint_concurrent_4.svg)

In general, checkpointing of non-leaf pages is more tricky than described above. While the checkpointer is writing children of non-leaf page, concurrent splits and merges could happen. In such cases, we have to reconstruct non-leaf pages based on the state of its children as we met them. Therefore, we might write to the storage a non-leaf page image, which never existed in the main memory. Furthermore, we could even write multiple storage pages corresponding to a single main memory page (imagine merges happen above the checkpoint boundary, while splits happen below the checkpoint boundary). Finally, that is OK, because it reflects how checkpointer wrote the children.

At the moment of time, there could be multiple checkpoints which use different but overlapping sets of blocks. Therefore, [free space management](fsm.md) becomes an untrivial task.
At the moment of time, there could be multiple checkpoints which use different but overlapping sets of blocks. Therefore, [free space management](images/fsm.md) becomes an untrivial task.

See [the detailed description of checkpointing algorithm](checkpoint.md).
See [the detailed description of checkpointing algorithm](images/checkpoint.md).

Undo log
--------

OrioleDB implements transactions and MVCC using UNDO log. The row-level undo records comprises chains of row versions. Particularly, tuple headers are connected in lists, where the head is located on the data page, and the rest of the elements are located in undo log. See the diagram below.

![Row level undo](row_level_undo.svg)
![Row level undo](images/row_level_undo.svg)

Some undo records include both tuple header and tuple body (update record), while some do not alter tuple body and contain just tuple header (delete and row-level lock records).

Expand All @@ -121,15 +121,15 @@ OrioleDB also supports block-level undo records. The block-level undo records a

The diagram below gives an example of a `compact` block-level undo record. Here the data pages contains tuples `t1`, `t2` and `t5`. However, a page image in the undo log contains tuples `t1`, `t2`, `t3`, and `t4`. That means, when tuples `t3` and `t4` were deleted, we lacked space for a new tuple `t5`. In order to do this, we made a `compaction` first. Therefore, we issue a page-level undo record and erase tuples `t3` and `t4` to fit `t5`.

![Block level undo](block_level_undo.svg)
![Block level undo](images/block_level_undo.svg)

OrioleDB has three types of block-level undo records:

1. Compact undo record: one data page references one undo page image,
2. Split undo record: two data pages reference one undo page image,
3. Merge undo record: one data page references two undo page images.

OrioleDB uses both circular buffers and block buffers for accessing the undo log. See [undo log storage](buffering.md) for details.
OrioleDB uses both circular buffers and block buffers for accessing the undo log. See [undo log storage](images/buffering.md) for details.

Row-level WAL
-------------
Expand All @@ -148,11 +148,11 @@ Since OrioleDB implements fuzzy checkpointing, we require idempotency property h

OrioleDB implements parallel application of WAL records. It launches `orioledb.recovery_pool_size` number of workers. Each worker is responsible for its own set of primary key values (according to hash value). The startup process distributes row-level WAL records to the queues connected to workers.

![Block level undo](recovery_wokers.svg)
![Block level undo](images/recovery_wokers.svg)

Queues might be processed at different paces. In order to evade MVCC anomalies, we assume the transaction to be committed and visible for readers only once all the workers have completed all the pieces of work associated with that transaction.

See [the details about OrioleDB's recovery](recovery.md).
See [the details about OrioleDB's recovery](images/recovery.md).

System catalog
--------------
Expand Down
10 changes: 5 additions & 5 deletions doc/buffering.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,20 @@ OrioleDB keeps enough undo log to fulfill the following requirements:

The picture below illustrates the filling of the undo circular buffer. The retain location is the minimal position where we must keep undo records to fulfill the requirements. The arrows between the undo records illustrate the insertion order, not the actual links.

![Undo circilar buffer](undo_buffer_1.svg)
![Undo circilar buffer](images/undo_buffer_1.svg)

Once we reach the end of the circular buffer, we start from the beginning if the retain location allows us to do this without the wraparound. Unless we need to retain too many undo records, we may store all of them in the circular buffer.

![Undo circilar buffer wraparound](undo_buffer_2.svg)
![Undo circilar buffer wraparound](images/undo_buffer_2.svg)

Once we need to add a new undo record, but the corresponding space is still occupied by retained undo records, we need to write some undo records to undo files. The picture below shows undo log split between the circular buffer and undo file. The records before written location are kept in undo files, and the records after written location are kept in the circular buffer. If we don't need to retain much of undo log (for instance, some long-running transaction was finished), we may switch back to the storage of the whole undo log in the circular buffer.

![Undo log split](undo_buffer_3.svg)
![Undo log split](images/undo_buffer_3.svg)

The two pictures below explain how we write a chunk of undo records into the data file. At first, we set the "write in-progress" location. This means that undo records within the range [written location; write in-progress location) are about to be written to undo files. During this period, undo records from this range can still be read from the circular buffer but can't be written (in some situations, we need to update existing undo records).

![Undo log write in-progress](undo_buffer_4.svg)
![Undo log write in-progress](images/undo_buffer_4.svg)

After the writing is finished, the written location is advanced, and the whole undo range is available for reading and writing.

![Undo log written](undo_buffer_5.svg)
![Undo log written](images/undo_buffer_5.svg)
6 changes: 3 additions & 3 deletions doc/checkpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Checkpointer walks OrioleDB trees in LNR-order. The walk is divided into steps.

The picture below is the example of OrioleDB B-tree walking by the checkpointer.

![Checkpoint walk the tree](checkpoint_walk.svg)
![Checkpoint walk the tree](images/checkpoint_walk.svg)

Checkpointer comes to the root page `n1` with the `WalkDownwards` message (step 1), then checks the first downlink `l1`. Since `l1` is the in-memory downlink, the checkpointer moves to the leaf page `n2` with the `WalkDownwards` message (step 2). After flushing `n2`, checkpointer comes back to `n1` with the `WalkUpwards` message (step 3) and continues iteration over `n1` downlinks. Similarly, it walks down to the `n3` and back via the `l2` downlink (steps 4 and 5). `l3` appears to be IO in-progress downlink, and the checkpointer has to unlock the `n1` page, and wait till IO is completed and continue with `WalkContinue` message (step 6). After relocking `n1`, checkpointer finds `l3` to be an on-disk downlink and copies it "as is". Finally, checkpointer walks down to the `n4` and back via the `l4` downlink (steps 7 and 8). Then `n1` is done, the checkpointer finishes the walk with the `WalkUpwards` message.

Expand All @@ -25,7 +25,7 @@ Note that the reconstructed state does not contain in-memory downlinks. In-memo

The picture below represents an example of a checkpoint state.

![Checkpoint state 1](checkpoint_state_1.svg)
![Checkpoint state 1](images/checkpoint_state_1.svg)

The root page `n1`, internal page `n3`, and leaf page `n8` are currently under checkpointing. The downlink `l2` is written to the reconstructed state. The downlink `l3` is the `next downlink` in the reconstructed state. Its key is known, but the link is not because the corresponding children were not written yet. Similarly, downlink `l6` is written, downlink `l7` is the `next downlink`, and downlink `l8` is not processed yet.

Expand All @@ -38,7 +38,7 @@ Let us imagine that the following event happened:

The resulting state is given in the picture below. Note that the reconstructed page image contains links `l6` and `l7` (as we visited them before the merge) but contains `l8` and the `next downlink` corresponding to `l9` (as we visited those downlinks after the split).

![Checkpoint state 1](checkpoint_state_2.svg)
![Checkpoint state 1](images/checkpoint_state_2.svg)

Autonomous non-leaf pages
-------------------------
Expand Down
34 changes: 17 additions & 17 deletions doc/concurrency.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,19 @@ In OrioleDB, in-memory downlinks can be changed to on-disk downlinks and vice ve

Step 1 depicts the initial state of part of the tree comprising parent page 1 and child page 2.

![Split step 1](split_step_1.svg)
![Split step 1](images/split_step_1.svg)

Step 2 depicts the split of page 2, creating the new page 3. At this point, page 2 has a rightlink to page 3. At this point, if a concurrent process is looking for the key located on page 3, it will check the hikey of page 2 and traverse to page 3 via rightlink. The keys located on page 2 could be found as usual.

![Split step 2](split_step_2.svg)
![Split step 2](images/split_step_2.svg)

Step 3 depicts the insertion of the new downlink to page 1. The rightlink between pages 2 and 3 still persists. At this point, the concurrent process can locate page 3 via downlink. If the concurrent process came to page 2 before downlink insertion, it still could use the rightlink.

![Split step 3](split_step_3.svg)
![Split step 3](images/split_step_3.svg)

Step 4 depicts the removal of the rightlink from page 2 to page 3. If a concurrent process, which came to page 2 before downlink insertion, is looking for a key located on page 3, then it has to check the rightlink and restart from page 1.

![Split step 4](split_step_4.svg)
![Split step 4](images/split_step_4.svg)

Page eviction
-------------
Expand All @@ -34,19 +34,19 @@ Page eviction is forbidden for the page with a rightlink or without a downlink f

Step 1 depicts the initial state of part of the tree comprising parent page 1 and child page 2. Page 2 is locked and should be evicted. At this point evicting process should find and lock the parent page 1 using the page 2 hikey to find it.

![Evict step 1](evict_step_1.svg)
![Evict step 1](images/evict_step_1.svg)

Step 2 depicts both page 1 and page 2 locked. At this point evicting process repaces in-memory downlink with IO downlink and increases page change count. IO downlink prevents a concurrent process from using it, making them wait till IO is completed. Increased change count prevents all the concurrent processes, which managed to use in-memory downlink, from using page 2.

![Evict step 2](evict_step_2.svg)
![Evict step 2](images/evict_step_2.svg)

Step 3 depicts page 1 disconnected from its child with IO downlink.

![Evict step 3](evict_step_3.svg)
![Evict step 3](images/evict_step_3.svg)

Finally evicting process writes the on-disk downlink to page 1 (step 4).

![Evict step 4](evict_step_4.svg)
![Evict step 4](images/evict_step_4.svg)

Page load
---------
Expand All @@ -55,19 +55,19 @@ When Postgres backend needs a page referenced by the on-disk downlink on OrioleD

Step 1 depicts the initial state of the non-leaf page 1 before loading its child. Page 1 is locked and has an on-disk downlink. At this point, the process needs to replace the downlink with IO in-progress one, unlock page 1 and start the IO.

![Load step 1](load_step_1.svg)
![Load step 1](images/load_step_1.svg)

Step 2 depicts the page 1 state while the IO is in progress. All the concurrent processes dealing with that downlink must wait until IO is completed. When the IO is completed, our process needs to relock page 1.

![Load step 2](load_step_2.svg)
![Load step 2](images/load_step_2.svg)

Step 3 depicts the state when page 1 is relocked. At this point, our process needs to add child page 2 and make the downlink on page 1 point to page 2.

![Load step 3](load_step_3.svg)
![Load step 3](images/load_step_3.svg)

Step 4 represents the final state. Page 2 is loaded, and page 1 contains the in-memory downlink to page 2.

![Load step 4](load_step_4.svg)
![Load step 4](images/load_step_4.svg)


Page merge
Expand All @@ -77,20 +77,20 @@ When OrioleDB has a page candidate for eviction, and that page is too sparse (wi

Step 1 depicts the initial state of part of the tree comprising parent page 1, child page to be merged 3 (locked), left sibling 2, and right sibling 4. At this point, we need to lock the parent page 1. We release child page 3 lock first.

![Merge step 1](merge_step_1.svg)
![Merge step 1](images/merge_step_1.svg)

Step 2 depicts the state where parent page 1 is locked, but the child is not. At this point, we need to relock the child page 3 to be merged. If this page is gone (due to concurrent eviction or another merge), then give up with merging. We also check that parent is not under checkpoint and give up otherwise.

![Merge step 2](merge_step_2.svg)
![Merge step 2](images/merge_step_2.svg)

Step 3 depicts both parent page 1 and child page 3 locked. Here we need to select the way to merge. We check that child page 3 is not under checkpoint and does not have a rightlink. Give up otherwise.

![Merge step 3](merge_step_3.svg)
![Merge step 3](images/merge_step_3.svg)

Step 4 depicts two possible ways to merge: with right sibling 4 (upper picture) or with left sibling 3 (upper picture). Note that the left page always saves its identity in either direction we chose. So if we merge to the left, page 3 will be removed. We cannot merge with a sibling under checkpoint or have a rightlink.

![Merge step 4](merge_step_4.svg)
![Merge step 4](images/merge_step_4.svg)

Step 5 depicts the merge result. In this example, we merged to the right. All the tuples on page 4 were merged into page 3. The result page is marked as page 34. Page 34 also has a hikey of page 4. Also, we remove page 4 and the corresponding downlink on page 1.

![Merge step 5](merge_step_5.svg)
![Merge step 5](images/merge_step_5.svg)
Loading

0 comments on commit 6a41bc5

Please sign in to comment.