Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Setup cgroup v2 in C++ #49416

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions src/ray/common/cgroup/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,20 @@ ray_cc_library(
"@com_google_absl//absl/strings:str_format",
],
)

ray_cc_library(
name = "cgroup_context",
hdrs = ["cgroup_context.h"],
)

ray_cc_library(
name = "cgroup_utils",
srcs = ["cgroup_utils.cc"],
hdrs = ["cgroup_utils.h"],
deps = [
":cgroup_context",
"//src/ray/util",
"@com_google_absl//absl/strings:str_format",
"@com_google_absl//absl/strings",
],
)
42 changes: 42 additions & 0 deletions src/ray/common/cgroup/cgroup_context.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
// Copyright 2024 The Ray Authors.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#pragma once

#include <unistd.h>

#include <cstdint>
#include <string>

namespace ray {

// Context used to setup cgroupv2 for a task / actor.
struct PhysicalModeExecutionContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: do we need this separate config class from CgroupV2Setup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These data fields are necessary to construct and destruct cgroup;
As of now the struct doesn't seem that necessary since it only contains 4 fields and we could directly pass them into the factory function, but we could have much more fields (i.e. cpu-related, resource min / high), better to have a struct.

// Directory for cgroup, which is applied to application process.
//
// TODO(hjiang): Revisit if we could save some CPU/mem with string view.
std::string cgroup_directory;
// A unique id to uniquely identity a certain task / actor attempt.
std::string id;
// PID for the process.
pid_t pid;

// Memory-related spec.
//
// Unit: bytes. Corresponds to cgroup V2 `memory.max`, which enforces hard cap on max
// memory consumption. 0 means no limit.
uint64_t max_memory = 0;
};

} // namespace ray
161 changes: 161 additions & 0 deletions src/ray/common/cgroup/cgroup_utils.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
// Copyright 2024 The Ray Authors.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#include "ray/common/cgroup/cgroup_utils.h"

#ifndef __linux__
namespace ray {
bool CgroupV2Setup::SetupCgroupV2ForContext(const PhysicalModeExecutionContext &ctx) {
return false;
}
/*static*/ bool CgroupV2Setup::CleanupCgroupV2ForContext(
const PhysicalModeExecutionContext &ctx) {
return false;
}
} // namespace ray
#else // __linux__

#include <sys/stat.h>

#include <fstream>

#include "absl/strings/str_format.h"
#include "absl/strings/str_join.h"
#include "absl/strings/str_split.h"
#include "ray/util/logging.h"

namespace ray {

namespace {

// Owner can read and write.
constexpr int kCgroupV2FilePerm = 0600;

// There're two types of memory cgroup constraints:
// 1. For those with limit capped, they will be created a dedicated cgroup;
// 2. For those without limit specified, they will be added to the default cgroup.
static constexpr std::string_view kDefaultCgroupV2Id = "default_cgroup_id";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a uuid, and we should not use a name like this visible in linux. We had the idea of getting a default name from cluster ID right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed as id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had the idea of getting a default name from cluster ID right?

I'm not sure how cluster id is related here?

// Open a cgroup path and append write [content] into the file.
void OpenCgroupV2FileAndAppend(std::string_view path, std::string_view content) {
std::ofstream out_file{path.data(), std::ios::out | std::ios::app};
out_file << content;
}

bool CreateNewCgroupV2(const PhysicalModeExecutionContext &ctx) {
// Sanity check.
RAY_CHECK(!ctx.id.empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[idea] can we put these contraints into the ctor of PhysicalModeExecutionContext

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that:

  • I view context struct as a wrapper of function arguments, just to save too many possible combinations of cgroup configs;
  • If I don't mis-understand, you're asking me to construct the context via constructor and check there? I didn't go that path (as I mentioned above), is we could have multiple combinations
    • I don't want to have a super large constructor, which takes all cgroup params (i.e. Context(mem_min, cpu_min, mem_max, cpu_max, ...));
    • Also don't prefer to have different sets of cgroup params (i.e. Context(mem_min, mem_max), Context(cpu_min, cpu_max), ...)
  • Maybe I could add a check util function inside / outside of the struct?

RAY_CHECK_NE(ctx.id, kDefaultCgroupV2Id);
RAY_CHECK_GT(ctx.max_memory, 0);

const std::string cgroup_folder =
absl::StrFormat("%s/%s", ctx.cgroup_directory, ctx.id);
int ret_code = mkdir(cgroup_folder.data(), kCgroupV2FilePerm);
if (ret_code != 0) {
return false;
}

const std::string procs_path = absl::StrFormat("%s/cgroup.procs", cgroup_folder);
OpenCgroupV2FileAndAppend(procs_path, absl::StrFormat("%d", ctx.pid));

// Add max memory into cgroup.
const std::string max_memory_path = absl::StrFormat("%s/memory.max", cgroup_folder);
OpenCgroupV2FileAndAppend(max_memory_path, absl::StrFormat("%d", ctx.max_memory));

return true;
}

bool UpdateDefaultCgroupV2(const PhysicalModeExecutionContext &ctx) {
// Sanity check.
RAY_CHECK(!ctx.id.empty());
RAY_CHECK_EQ(ctx.id, kDefaultCgroupV2Id);
RAY_CHECK_EQ(ctx.max_memory, 0);

const std::string cgroup_folder =
absl::StrFormat("%s/%s", ctx.cgroup_directory, ctx.id);
int ret_code = mkdir(cgroup_folder.data(), kCgroupV2FilePerm);
if (ret_code != 0) {
return false;
}

const std::string procs_path = absl::StrFormat("%s/cgroup.procs", cgroup_folder);
OpenCgroupV2FileAndAppend(procs_path, absl::StrFormat("%d", ctx.pid));

return true;
}

bool DeleteCgroupV2(const PhysicalModeExecutionContext &ctx) {
// Sanity check.
RAY_CHECK(!ctx.id.empty());
RAY_CHECK_NE(ctx.id, kDefaultCgroupV2Id);
RAY_CHECK_GT(ctx.max_memory, 0);

const std::string cgroup_folder =
absl::StrFormat("%s/%s", ctx.cgroup_directory, ctx.id);
return rmdir(cgroup_folder.data()) == 0;
}

void PlaceProcessIntoDefaultCgroup(const PhysicalModeExecutionContext &ctx) {
const std::string procs_path =
absl::StrFormat("%s/%s/cgroup.procs", ctx.cgroup_directory, kDefaultCgroupV2Id);
{
std::ofstream out_file{procs_path.data(), std::ios::out};
out_file << ctx.pid;
}

return;
}

} // namespace

/*static*/ std::unique_ptr<CgroupV2Setup> CgroupV2Setup::New(
PhysicalModeExecutionContext ctx) {
if (!CgroupV2Setup::SetupCgroupV2ForContext(ctx)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: do we want to CHECK fail in ctor on "cgroup already exists"? or if we have 2 objs managing the same cgroup, in the first dtor it's deleted, affecting the other. We don't expect anyone creating cgroups with our naming schema so CHECK failure should be acceptable.

Copy link
Contributor Author

@dentiny dentiny Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the folder already exists, cgroup setup function returns false, and nullptr returned here.

My idea is to fallback to "not use cgroup" behavior, wondering if that sounds ok to you?
Or you explicitly want to special handle EEXISTS error code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I special handle EEXISTS to treat it as internal error in the latest commit.

return nullptr;
}
return std::unique_ptr<CgroupV2Setup>(new CgroupV2Setup(std::move(ctx)));
}

CgroupV2Setup::~CgroupV2Setup() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before deleting the cgroup v2 we need to first move proc out of it, or the deletion would fail. for that you need to somehow record the prev cgroup if any, or we can blindly move all those procs to the default cgroup. this can come from a "global cgroup mgr" in raylet:

class CgroupV2Manager {

ctor(default_cgroup_name);

bool PutPidIntoDefaultCgroupRemovingAnyCgroupsIfAny(pid);
bool CreateCgroupV2ForPid(pid, cgroup_name);

};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before deleting the cgroup v2 we need to first move proc out of it, or the deletion would fail.

Yes.. fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, I add a few items based on our offline discussion:

  • I add a README for cgroup
  • I add comments on local task manager on how I plan to integrate the cgroup RAII class with local task manager
  • I add comments on how I plan to prepare cgroup basic setup in raylet

if (!CleanupCgroupV2ForContext(ctx_)) {
RAY_LOG(ERROR) << "Fails to cleanup cgroup for execution context with id " << ctx_.id;
}
}

/*static*/ bool CgroupV2Setup::SetupCgroupV2ForContext(
const PhysicalModeExecutionContext &ctx) {
// Create a new cgroup if max memory specified.
if (ctx.max_memory > 0) {
return CreateNewCgroupV2(ctx);
}

// Update default cgroup if no max resource specified.
return UpdateDefaultCgroupV2(ctx);
}

/*static*/ bool CgroupV2Setup::CleanupCgroupV2ForContext(
const PhysicalModeExecutionContext &ctx) {
// Delete the dedicated cgroup if max memory specified.
if (ctx.max_memory > 0) {
PlaceProcessIntoDefaultCgroup(ctx);
return DeleteCgroupV2(ctx);
}

// If pid already in default cgroup, no action needed.
return true;
}

} // namespace ray

#endif // __linux__
61 changes: 61 additions & 0 deletions src/ray/common/cgroup/cgroup_utils.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
// Copyright 2024 The Ray Authors.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

// Util functions to setup cgroup.

#pragma once

#include <memory>
#include <string_view>
#include <utility>

#include "ray/common/cgroup/cgroup_context.h"

namespace ray {

// A util class which sets up cgroup at construction, and cleans up at destruction.
// On ctor, creates a cgroup v2 if necessary based on the context. Then puts `ctx.pid`
// into this cgroup.
// On dtor, puts `ctx.pid` into the default cgroup, and remove this cgroup v2 if any.
//
// Precondition:
// 1. rw permission for cgroup has been validated.
// 2. Cgroup folder (i.e. default application cgroup folder) has been properly setup.
// See README under this folder for more details.
class CgroupV2Setup {
dentiny marked this conversation as resolved.
Show resolved Hide resolved
public:
// A failed construction returns nullptr.
static std::unique_ptr<CgroupV2Setup> New(PhysicalModeExecutionContext ctx);

~CgroupV2Setup();

CgroupV2Setup(const CgroupV2Setup &) = delete;
CgroupV2Setup &operator=(const CgroupV2Setup &) = delete;
CgroupV2Setup(CgroupV2Setup &&) = delete;
CgroupV2Setup &operator=(CgroupV2Setup &&) = delete;

private:
CgroupV2Setup(PhysicalModeExecutionContext ctx) : ctx_(std::move(ctx)) {}

// Setup cgroup based on the given [ctx]. Return whether the setup succeeds or not.
static bool SetupCgroupV2ForContext(const PhysicalModeExecutionContext &ctx);

// Cleanup cgroup based on the given [ctx]. Return whether the cleanup succeds or not.
static bool CleanupCgroupV2ForContext(const PhysicalModeExecutionContext &ctx);

// Execution context for current cgroup v2 setup.
PhysicalModeExecutionContext ctx_;
};

} // namespace ray
12 changes: 12 additions & 0 deletions src/ray/raylet/local_task_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,16 @@ void LocalTaskManager::DispatchScheduledTasksToWorkers() {
const std::shared_ptr<WorkerInterface> worker,
PopWorkerStatus status,
const std::string &runtime_env_setup_error_message) -> bool {
// TODO(hjiang): After getting the ready-to-use worker and task id, we're
rynewang marked this conversation as resolved.
Show resolved Hide resolved
// able to get physical execution context.
//
// ownership chain: raylet has-a node manager, node manager has-a local task
// manager.
//
// - PID: could get from available worker
// - Attempt id: could pass a global attempt id generator from raylet
// - Cgroup application folder: could pass from raylet

return PoppedWorkerHandler(worker,
status,
task_id,
Expand Down Expand Up @@ -729,6 +739,8 @@ void LocalTaskManager::RemoveFromRunningTasksIfExists(const RayTask &task) {
auto sched_cls = task.GetTaskSpecification().GetSchedulingClass();
auto it = info_by_sched_cls_.find(sched_cls);
if (it != info_by_sched_cls_.end()) {
// TODO(hjiang): After remove the task id from `running_tasks`, corresponding cgroup
// will be updated.
it->second.running_tasks.erase(task.GetTaskSpecification().TaskId());
if (it->second.running_tasks.size() == 0) {
info_by_sched_cls_.erase(it);
Expand Down
2 changes: 2 additions & 0 deletions src/ray/raylet/local_task_manager.h
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,8 @@ class LocalTaskManager : public ILocalTaskManager {
capacity(cap),
next_update_time(std::numeric_limits<int64_t>::max()) {}
/// Track the running task ids in this scheduling class.
///
/// TODO(hjiang): Store cgroup manager along with task id as the value for map.
absl::flat_hash_set<TaskID> running_tasks;
/// The total number of tasks that can run from this scheduling class.
const uint64_t capacity;
Expand Down
2 changes: 2 additions & 0 deletions src/ray/raylet/main.cc
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,8 @@ int main(int argc, char *argv[]) {
RAY_LOG(INFO) << "Setting cluster ID to: " << cluster_id;
gflags::ShutDownCommandLineFlags();

// TODO(hjiang): Before we do any actual work, setup cgroup.

// Configuration for the node manager.
ray::raylet::NodeManagerConfig node_manager_config;
absl::flat_hash_map<std::string, double> static_resource_conf;
Expand Down
Loading