Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] Refactor S3-FIFO and add SIEVE, LRU and CLOCK. #7508

Merged
merged 12 commits into from
Jul 9, 2024

Conversation

mfbalin
Copy link
Collaborator

@mfbalin mfbalin commented Jul 6, 2024

Description

Refactored the code and added the SIEVE, LRU and CLOCK cache policy algorithm.

CLOCK: https://people.csail.mit.edu/saltzer/Multics/MHP-Saltzer-060508/bookcases/M00s/M0104%20074-12).PDF

The S3-FIFO implementation was behaving a bit differently from the official implementation (upto 2% higher miss rate). So I wrote a short script to compare against the official implementation on several datasets as can be seen below. Moreover, I got rid of garbage collection if the ghost map gets too big and instead used a queue for it so that we never have to do extra work sometimes. This is because when using parallelism, we will wait for the slowest partition and when we do overlapping in the dataloader, these delays will cause a bubble in the pipeline.

Now, the behaviors consistently match across different datasets for S3-FIFO:

My miss rates (s3-fifo: 0.115895, lru: 0.165391, sieve: 0.168564, clock: 0.168460), miss rate original: (s3-fifo: 0.119878, lru: 0.165391), fiu/fiu_casa-110108-112108.oracleGeneral.zst: : 8973201it [20:06, 7437.01it/s]
My miss rates (s3-fifo: 0.316334, lru: 0.350072, sieve: 0.409565, clock: 0.349661), miss rate original: (s3-fifo: 0.316330, lru: 0.350072), fiu/fiu_homes-110108-112108.oracleGeneral.zst: : 17836701it [44:14, 6718.70it/s]
My miss rates (s3-fifo: 0.372138, lru: 0.414476, sieve: 0.384086, clock: 0.414683), miss rate original: (s3-fifo: 0.372138, lru: 0.414476), alibabaBlock/io_traces.ns0.oracleGeneral.zst: : 13800000it [35:29, 6481.64it/s]
My miss rates (s3-fifo: 0.220733, lru: 0.223484, sieve: 0.232164, clock: 0.224612), miss rate original: (s3-fifo: 0.220650, lru: 0.223484), tencentBlock/tencentBlock.ns1182.oracleGeneral.zst: : 16024596it [38:18, 6972.53it/s]
My miss rates (s3-fifo: 0.841126, lru: 0.870302, sieve: 0.850204, clock: 0.868771), miss rate original: (s3-fifo: 0.841126, lru: 0.870302), msr/msr_proj_2.oracleGeneral.zst: : 29266482it [1:30:12, 5407.19it/s]
import requests
import zstandard as zstd
from struct import iter_unpack
import torch
from dgl.graphbolt.impl import FeatureCache
from cachemonCache import S3FIFO, LRU
from tqdm import tqdm
import os

dtype = torch.int64
num_partitions = 1

base_URL = "https://ftp.pdl.cmu.edu/pub/datasets/twemcacheWorkload/cacheDatasets/"
datasets = [
    (50000, "fiu/fiu_casa-110108-112108.oracleGeneral.zst"),
    (100000, "fiu/fiu_homes-110108-112108.oracleGeneral.zst"),
    (100000, "alibabaBlock/io_traces.ns0.oracleGeneral.zst"),
    (100000, "tencentBlock/tencentBlock.ns1182.oracleGeneral.zst"),
    (1000000, "msr/msr_proj_2.oracleGeneral.zst"),
]
for capacity, dataset in datasets:
    data = requests.get(os.path.join(base_URL, dataset), stream=False).content
    pbar = tqdm(iter_unpack("=IQIq", zstd.ZstdDecompressor().decompress(data)))
    my_caches = {policy: FeatureCache([capacity, 1], dtype, num_partitions, policy) for policy in ["s3-fifo", "lru", "sieve", "clock"]}
    other_caches = {name: cache(capacity) for name, cache in zip(["s3-fifo", "lru"], [S3FIFO, LRU])}
    for i, (timestamp, obj_id, obj_size, next_access_vtime) in enumerate(pbar):
        for cache in my_caches.values():
            values, missing_index, missing_keys = cache.query(torch.tensor([obj_id]))
            if missing_keys.numel() > 0:
                cache.replace(missing_keys, torch.tensor([obj_size]))
        for cache in other_caches.values():
            val = cache.get(obj_id)
            if val is None:
                cache.put(obj_id, obj_size)
        if i % 1000 == 999:
            pbar.set_description(f"My miss rates ({", ".join([f"{name}: {cache.miss_rate:.6f}" for name, cache in my_caches.items()])}), miss rate original: ({", ".join([f"{name}: {1.0 - cache.n_hit / (i + 1):.6f}" for name, cache in other_caches.items()])}), {dataset}")

@mfbalin mfbalin requested a review from frozenbugs July 6, 2024 06:08
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 6, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 6, 2024

Commit ID: 6df5c68

Build ID: 1

Status: ❌ CI test failed in Stage [Torch CPU Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 6, 2024

Commit ID: f30814b

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin changed the title [GraphBolt] Refine S3-FIFO cache policy. [GraphBolt] Refactor S3-FIFO and add SIEVE. Jul 8, 2024
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 8, 2024

Commit ID: 955b55a

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin changed the title [GraphBolt] Refactor S3-FIFO and add SIEVE. [GraphBolt] Refactor S3-FIFO and add SIEVE and LRU. Jul 8, 2024
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 8, 2024

Commit ID: 18ff830

Build ID: 4

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 8, 2024

Commit ID: f0f27ba

Build ID: 5

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 8, 2024

Commit ID: fb8eaf8

Build ID: 6

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 8, 2024

Commit ID: 5247dab

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 8, 2024

Commit ID: 237268e

Build ID: 8

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 8, 2024

Commit ID: f70a3a2

Build ID: 9

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin changed the title [GraphBolt] Refactor S3-FIFO and add SIEVE and LRU. [GraphBolt] Refactor S3-FIFO and add SIEVE, LRU and CLOCK. Jul 8, 2024
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 8, 2024

Commit ID: 1f1332c

Build ID: 10

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 9, 2024

Commit ID: c312794

Build ID: 11

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

graphbolt/src/cache_policy.h Show resolved Hide resolved
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 9, 2024

Commit ID: d73e338

Build ID: 12

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin merged commit de33ee1 into dmlc:master Jul 9, 2024
2 checks passed
@mfbalin mfbalin deleted the gb_refine_s3_fifo_cache_policy branch July 9, 2024 05:37
lijialin03 pushed a commit to lijialin03/dgl that referenced this pull request Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants