Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test] refactor Github Actions Used for FedML-AI/FedML CI #2180

Open
wants to merge 69 commits into
base: alexleung/dev_v070_for_refactor
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
921c199
Update smoke_test_cross_silo_fedavg_attack_linux.yml
xiang-wang-innovator Jun 6, 2024
5bca440
Update smoke_test_cross_silo_fedavg_attack_linux.yml
xiang-wang-innovator Jun 7, 2024
23c955e
Update smoke_test_cross_silo_fedavg_attack_linux.yml
xiang-wang-innovator Jun 7, 2024
15d341e
Update sync-fedml-pip.sh
xiang-wang-innovator Jun 7, 2024
9cb2a59
Update smoke_test_cross_silo_fedavg_attack_linux.yml
xiang-wang-innovator Jun 7, 2024
74b8f59
Update smoke_test_ml_engines_linux_tf.yml
xiang-wang-innovator Jun 7, 2024
f9f36f6
Update smoke_test_cross_silo_fedavg_attack_linux.yml
xiang-wang-innovator Jun 7, 2024
064ec96
Update smoke_test_cross_silo_fedavg_attack_linux.yml
xiang-wang-innovator Jun 7, 2024
c4a8714
[Deploy] Report worker's connectivity when it finished.
Raphael-Jin Jun 11, 2024
f644812
Update smoke_test_cross_silo_fedavg_attack_linux.yml
xiang-wang-innovator Jun 11, 2024
c37573c
Update smoke_test_pip_cli_sp_linux.yml
xiang-wang-innovator Jun 11, 2024
753f95c
Update smoke_test_pip_cli_sp_linux.yml
xiang-wang-innovator Jun 11, 2024
8bdda1c
Update smoke_test_pip_cli_sp_linux.yml
xiang-wang-innovator Jun 11, 2024
2b15e30
Update build.sh
xiang-wang-innovator Jun 11, 2024
c315966
Update smoke_test_pip_cli_sp_linux.yml
xiang-wang-innovator Jun 11, 2024
a2c9410
Update smoke_test_pip_cli_sp_linux.yml
xiang-wang-innovator Jun 11, 2024
4105806
Update smoke_test_pip_cli_sp_linux.yml
xiang-wang-innovator Jun 11, 2024
83d48d2
Update smoke_test_pip_cli_sp_linux.yml
xiang-wang-innovator Jun 11, 2024
876d26c
Merge branch 'dev/v0.7.0' into wx_develop_action
Jun 11, 2024
207b5fb
Merge branch 'raphael/unify-connectivity' of https://github.com/FedML…
fedml-dimitris Jun 11, 2024
4a9622c
Adding default http connectivity type constant. Fixing minor typos an…
fedml-dimitris Jun 11, 2024
34fdba0
Merge pull request #2157 from FedML-AI/raphael/unify-connectivity
Raphael-Jin Jun 11, 2024
23d88fc
[Deploy] Remove unnecessary logic.
Raphael-Jin Jun 11, 2024
e0ad9b5
[Deploy] Remove unnecessary logic; Rename readiness check function; F…
Raphael-Jin Jun 11, 2024
64e8c77
[Deploy] Nit
Raphael-Jin Jun 11, 2024
9194f84
[Deploy] Hide unnecessary log.
Raphael-Jin Jun 11, 2024
8530973
Merge pull request #2165 from FedML-AI/raphael/refactor-container-dep…
fedml-dimitris Jun 11, 2024
008266f
add some news
Jun 12, 2024
e25ad75
modify smoke test pip cli sp linux
Jun 12, 2024
62093a5
change path address
Jun 12, 2024
295ca57
cancel fedml login/ fedml build
Jun 12, 2024
7554a74
update smoke_test_security
Jun 12, 2024
8900842
update smoke test simulation mpi linux
Jun 12, 2024
8d55bc8
add
Jun 12, 2024
745ef6e
update mpi linux
Jun 12, 2024
bde643e
update mpi linux
Jun 12, 2024
3fbaaee
Merge branch 'dev/v0.7.0' into wx_develop_action
Jun 12, 2024
c20dd77
change git fetch
Jun 12, 2024
bae59fb
update path
Jun 12, 2024
c4ec02d
modify
Jun 12, 2024
257c0a7
stash
Jun 12, 2024
e7f7bb9
modify
Jun 12, 2024
c89239a
add necessary things
Jun 12, 2024
590412c
modfiy
Jun 13, 2024
2dbbf33
add install fedml
Jun 13, 2024
28cb1fe
modify
Jun 13, 2024
742862f
change actions build
Jun 13, 2024
11ab658
modify github-action-docker
Jun 17, 2024
5fb11e8
moidfy
Jun 17, 2024
ff769f4
modify
Jun 17, 2024
23f15b2
Create python-package-conda.yml
xiang-wang-innovator Jun 17, 2024
a9967b2
modify workflow
Jun 17, 2024
f3fa51b
Merge pull request #1 from Qigemingziba/wx_develop_action
xiang-wang-innovator Jun 17, 2024
719cfe4
modify workflow
Jun 17, 2024
573d2f7
update the CI_build.yml
Jun 17, 2024
24196ec
modify workflow
Jun 17, 2024
b796dc8
test
Jun 17, 2024
41ea04a
completed job
Jun 17, 2024
6e6b2a2
add some file
Jun 17, 2024
12dae4d
modify
Jun 17, 2024
b3fc51e
modify bug
Jun 17, 2024
96b6dbf
test
Jun 17, 2024
846a6c9
ttt
Jun 17, 2024
6d33c2f
modify
Jun 17, 2024
95a9844
modify
Jun 17, 2024
07f6616
modify
Jun 17, 2024
4bbce76
Merge pull request #2 from Qigemingziba/test_pr
xiang-wang-innovator Jun 17, 2024
ea9320b
[Test] refactor Github Actions Used for FedML-AI/FedML CI
Jun 18, 2024
1275034
merge master
Jun 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[Deploy] Remove unnecessary logic; Rename readiness check function; F…
…orbidden user level control of host post.
  • Loading branch information
Raphael-Jin committed Jun 11, 2024
commit e0ad9b5bef5bcea1eaefe3458a3d6b49aa399d46
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
num_gpus = gpu_per_replica
gpu_ids, gpu_attach_cmd = None, ""

# Concatenate the model name
running_model_name = ClientConstants.get_running_model_name(
end_point_name, inference_model_name, model_version, end_point_id, model_id, edge_id=edge_id)

Expand All @@ -77,6 +78,7 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
config = yaml.safe_load(file)

# Resource related
inference_type = "default"
use_gpu = config.get('use_gpu', True)
num_gpus_frm_yml = config.get('num_gpus', None)
if not use_gpu:
Expand All @@ -85,9 +87,7 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
if num_gpus_frm_yml is not None:
num_gpus = int(num_gpus_frm_yml)
usr_indicated_wait_time = config.get('deploy_timeout', 900)
usr_indicated_worker_port = config.get('worker_port', "")
if usr_indicated_worker_port == "":
usr_indicated_worker_port = os.environ.get("FEDML_WORKER_PORT", "")
usr_indicated_retry_cnt = max(int(usr_indicated_wait_time) // 10, 1)
shm_size = config.get('shm_size', None)
storage_opt = config.get('storage_opt', None)
tmpfs = config.get('tmpfs', None)
Expand All @@ -96,17 +96,6 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
cpus = int(cpus)
memory = config.get('memory', None)

if usr_indicated_worker_port == "":
usr_indicated_worker_port = None
else:
usr_indicated_worker_port = int(usr_indicated_worker_port)

worker_port_env = os.environ.get("FEDML_WORKER_PORT", "")
worker_port_from_config = config.get('worker_port', "")
logging.info(f"usr_indicated_worker_port {usr_indicated_worker_port}, worker port env {worker_port_env}, "
f"worker port from config {worker_port_from_config}")

usr_indicated_retry_cnt = max(int(usr_indicated_wait_time) // 10, 1)
inference_image_name = config.get('inference_image_name',
ClientConstants.INFERENCE_SERVER_CUSTOME_IMAGE)
image_pull_policy = config.get('image_pull_policy', SchedulerConstants.IMAGE_PULL_POLICY_IF_NOT_PRESENT)
Expand Down Expand Up @@ -144,25 +133,15 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,

# If using customized image, then bootstrap + job will be the entry point
enable_custom_image = config.get("enable_custom_image", False)
# inference_type = "custom"
customized_image_entry_cmd = \
"/bin/bash /home/fedml/models_serving/fedml-deploy-bootstrap-entry-auto-gen.sh"

docker_registry_user_name = config.get("docker_registry_user_name", "")
docker_registry_user_password = config.get("docker_registry_user_password", "")
docker_registry = config.get("docker_registry", "")

port_inside_container = int(config.get("port_inside_container", 2345))
use_triton = config.get("use_triton", False)
if use_triton:
inference_type = "triton"
else:
inference_type = "default"

# Config check
if src_code_dir == "":
raise Exception("Please indicate source_code_dir in the fedml_model_config.yaml")
if relative_entry == "":
logging.warning("You missed main_entry in the fedml_model_config.yaml")
port_inside_container = int(config.get("port", 2345))

# Request the GPU ids for the deployment
if num_gpus > 0:
Expand All @@ -175,22 +154,10 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
end_point_id, end_point_name, inference_model_name, edge_id, replica_rank+1, gpu_ids)
logging.info("GPU ids allocated: {}".format(gpu_ids))

# Create the model serving dir if not exists
model_serving_dir = ClientConstants.get_model_serving_dir()
if not os.path.exists(model_serving_dir):
os.makedirs(model_serving_dir, exist_ok=True)
converted_model_path = os.path.join(model_storage_local_path, ClientConstants.FEDML_CONVERTED_MODEL_DIR_NAME)
if os.path.exists(converted_model_path):
model_file_list = os.listdir(converted_model_path)
for model_file in model_file_list:
src_model_file = os.path.join(converted_model_path, model_file)
dst_model_file = os.path.join(model_serving_dir, model_file)
if os.path.isdir(src_model_file):
if not os.path.exists(dst_model_file):
shutil.copytree(src_model_file, dst_model_file, copy_function=shutil.copy,
ignore_dangling_symlinks=True)
else:
if not os.path.exists(dst_model_file):
shutil.copyfile(src_model_file, dst_model_file)

if inference_engine != ClientConstants.INFERENCE_ENGINE_TYPE_INT_DEFAULT:
raise Exception(f"inference engine {inference_engine} is not supported")
Expand Down Expand Up @@ -228,13 +195,12 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
logging.info(f"Start pulling the inference image {inference_image_name}... with policy {image_pull_policy}")
ContainerUtils.get_instance().pull_image_with_policy(image_pull_policy, inference_image_name)

volumns = []
volumes = []
binds = {}
environment = {}

# data_cache_dir mounting
assert type(data_cache_dir_input) == dict or type(data_cache_dir_input) == str
if type(data_cache_dir_input) == str:
if isinstance(data_cache_dir_input, str):
# In this case, we mount to the same folder, if it has ~, we replace it with /home/fedml
src_data_cache_dir, dst_data_cache_dir = "", ""
if data_cache_dir_input != "":
Expand All @@ -253,28 +219,30 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
if type(src_data_cache_dir) == str and src_data_cache_dir != "":
logging.info("Start copying the data cache to the container...")
if os.path.exists(src_data_cache_dir):
volumns.append(src_data_cache_dir)
volumes.append(src_data_cache_dir)
binds[src_data_cache_dir] = {
"bind": dst_data_cache_dir,
"mode": "rw"
}
environment["DATA_CACHE_FOLDER"] = dst_data_cache_dir
else:
elif isinstance(data_cache_dir_input, dict):
for k, v in data_cache_dir_input.items():
if os.path.exists(k):
volumns.append(v)
volumes.append(v)
binds[k] = {
"bind": v,
"mode": "rw"
}
else:
logging.warning(f"{k} does not exist, skip mounting it to the container")
logging.info(f"Data cache mount: {volumns}, {binds}")
logging.info(f"Data cache mount: {volumes}, {binds}")
else:
logging.warning("data_cache_dir_input is not a string or a dictionary, skip mounting it to the container")

# Default mounting
if not enable_custom_image or (enable_custom_image and relative_entry != ""):
logging.info("Start copying the source code to the container...")
volumns.append(src_code_dir)
volumes.append(src_code_dir)
binds[src_code_dir] = {
"bind": dst_model_serving_dir,
"mode": "rw"
Expand All @@ -284,7 +252,7 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
host_config_dict = {
"binds": binds,
"port_bindings": {
port_inside_container: usr_indicated_worker_port
port_inside_container: None
},
"shm_size": shm_size,
"storage_opt": storage_opt,
Expand Down Expand Up @@ -312,7 +280,6 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
if not enable_custom_image:
# For some image, the default user is root. Unified to fedml.
environment["HOME"] = "/home/fedml"

environment["BOOTSTRAP_DIR"] = dst_bootstrap_dir
environment["FEDML_CURRENT_RUN_ID"] = end_point_id
environment["FEDML_CURRENT_EDGE_ID"] = edge_id
Expand All @@ -326,12 +293,13 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
for key in extra_envs:
environment[key] = extra_envs[key]

# Create the container
try:
host_config = client.api.create_host_config(**host_config_dict)
new_container = client.api.create_container(
image=inference_image_name,
name=default_server_container_name,
volumes=volumns,
volumes=volumes,
ports=[port_inside_container], # port open inside the container
environment=environment,
host_config=host_config,
Expand All @@ -349,22 +317,18 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,
while True:
cnt += 1
try:
if usr_indicated_worker_port is not None:
inference_http_port = usr_indicated_worker_port
break
else:
# Find the random port
port_info = client.api.port(new_container.get("Id"), port_inside_container)
inference_http_port = port_info[0]["HostPort"]
logging.info("inference_http_port: {}".format(inference_http_port))
break
# Find the random port
port_info = client.api.port(new_container.get("Id"), port_inside_container)
inference_http_port = port_info[0]["HostPort"]
logging.info("host port allocated: {}".format(inference_http_port))
break
except:
if cnt >= 5:
raise Exception("Failed to get the port allocation")
time.sleep(3)

# Logging the info from the container when starting
log_deployment_result(end_point_id, model_id, default_server_container_name,
log_deployment_output(end_point_id, model_id, default_server_container_name,
ClientConstants.CMD_TYPE_RUN_DEFAULT_SERVER,
inference_model_name, inference_engine, inference_http_port, inference_type,
retry_interval=10, deploy_attempt_threshold=usr_indicated_retry_cnt,
Expand All @@ -373,9 +337,8 @@ def start_deployment(end_point_id, end_point_name, model_id, model_version,

# Return the running model name and the inference output url
inference_output_url, running_model_version, ret_model_metadata, ret_model_config = \
get_model_info(inference_model_name, inference_engine, inference_http_port,
infer_host, False, inference_type, request_input_example=request_input_example,
enable_custom_image=enable_custom_image)
check_container_readiness(inference_http_port=inference_http_port, infer_host=infer_host,
request_input_example=request_input_example)

if inference_output_url == "":
return running_model_name, "", None, None, None
Expand Down Expand Up @@ -426,9 +389,8 @@ def should_exit_logs(end_point_id, model_id, cmd_type, model_name, inference_eng
# If the container has exited, return True, means we should exit the logs
try:
inference_output_url, model_version, model_metadata, model_config = \
get_model_info(model_name, inference_engine, inference_port, infer_host,
inference_type=inference_type, request_input_example=request_input_example,
enable_custom_image=enable_custom_image)
check_container_readiness(inference_http_port=inference_port, infer_host=infer_host,
request_input_example=request_input_example)
if inference_output_url != "":
logging.info("Log test for deploying model successfully, inference url: {}, "
"model metadata: {}, model config: {}".
Expand All @@ -443,7 +405,7 @@ def should_exit_logs(end_point_id, model_id, cmd_type, model_name, inference_eng
return False


def log_deployment_result(end_point_id, model_id, cmd_container_name, cmd_type,
def log_deployment_output(end_point_id, model_id, cmd_container_name, cmd_type,
inference_model_name, inference_engine,
inference_http_port, inference_type="default",
retry_interval=10, deploy_attempt_threshold=10,
Expand Down Expand Up @@ -542,10 +504,10 @@ def log_deployment_result(end_point_id, model_id, cmd_container_name, cmd_type,
time.sleep(retry_interval)


def is_client_inference_container_ready(infer_url_host, inference_http_port, inference_model_name, local_infer_url,
inference_type="default", model_version="", request_input_example=None):
def is_client_inference_container_ready(infer_url_host, inference_http_port, readiness_check_type="default",
readiness_check_cmd=None, request_input_example=None):

if inference_type == "default":
if readiness_check_type == "default":
default_client_container_ready_url = "http://{}:{}/ready".format("0.0.0.0", inference_http_port)
response = None
try:
Expand All @@ -555,59 +517,27 @@ def is_client_inference_container_ready(infer_url_host, inference_http_port, inf
if not response or response.status_code != 200:
return "", "", {}, {}

# Report the deployed model info
# Construct the model metadata (input and output)
model_metadata = {}
if request_input_example is not None and len(request_input_example) > 0:
model_metadata["inputs"] = request_input_example
else:
model_metadata["inputs"] = {"text": "What is a good cure for hiccups?"}
model_metadata["outputs"] = []
model_metadata["type"] = "default"

return "http://{}:{}/predict".format(infer_url_host, inference_http_port), None, model_metadata, None
else:
triton_server_url = "{}:{}".format(infer_url_host, inference_http_port)
if model_version == "" or model_version is None:
model_version = ClientConstants.INFERENCE_MODEL_VERSION
logging.info(
f"triton_server_url: {triton_server_url} model_version: {model_version} model_name: {inference_model_name}")
triton_client = http_client.InferenceServerClient(url=triton_server_url, verbose=False)
if not triton_client.is_model_ready(
model_name=inference_model_name, model_version=model_version
):
return "", model_version, {}, {}
logging.info(f"Model {inference_model_name} is ready, start to get model metadata...")
model_metadata = triton_client.get_model_metadata(model_name=inference_model_name, model_version=model_version)
model_config = triton_client.get_model_config(model_name=inference_model_name, model_version=model_version)
version_list = model_metadata.get("versions", None)
if version_list is not None and len(version_list) > 0:
model_version = version_list[0]
else:
model_version = ClientConstants.INFERENCE_MODEL_VERSION

inference_output_url = "http://{}:{}/{}/models/{}/versions/{}/infer".format(infer_url_host,
inference_http_port,
ClientConstants.INFERENCE_INFERENCE_SERVER_VERSION,
inference_model_name,
model_version)

return inference_output_url, model_version, model_metadata, model_config


def get_model_info(model_name, inference_engine, inference_http_port, infer_host="127.0.0.1", is_hg_model=False,
inference_type="default", request_input_example=None, enable_custom_image=False):
if model_name is None:
# TODO(Raphael): Support arbitrary readiness check command
logging.error(f"Unknown readiness check type: {readiness_check_type}")
return "", "", {}, {}

local_infer_url = "{}:{}".format(infer_host, inference_http_port)

if is_hg_model:
inference_model_name = "{}_{}_inference".format(model_name, str(inference_engine))
else:
inference_model_name = model_name

def check_container_readiness(inference_http_port, infer_host="127.0.0.1", request_input_example=None,
readiness_check_type="default", readiness_check_cmd=None):
response_from_client_container = is_client_inference_container_ready(
infer_host, inference_http_port, inference_model_name, local_infer_url,
inference_type, model_version="", request_input_example=request_input_example)
infer_host, inference_http_port, readiness_check_type, readiness_check_cmd,
request_input_example=request_input_example)

return response_from_client_container

Expand Down