Dynamic Resource Management Schemes for Containerized Deep Learning Applications
The increasing demand for learning from massive datasets is restructuring our economy. Effective learning, however, involves nontrivial computing resources. Most businesses utilize commercial infrastructure providers (e.g. AWS) to host their computing clusters in the cloud, where various jobs compete for available resources. While cloud resource management is a fruitful research field that has made many advances in production, such as Kubernetes and YARN, few efforts have been invested to further optimize the system performance, especially for deep learning (DL) training jobs in a container cluster. This work introduces advanced version of FlowCon 1.0, a system that is able to monitor the individual evaluation functions of DL jobs at runtime, and thus to make placement decisions on resource allocations elastically. We present a detailed design and implementation of FlowCon 2.0 and conduct intensive experiments over various DL models. The results demonstrate that FlowCon 2.0 significantly improves DL job completion time and resource utilization efficiency, compared to default systems. According to the results, overall FlowCon 2.0 is able to improve the completion time by up to 68.8% and meanwhile, reduce the makespan by 18.0%, in the presence of various DL job workloads. For instance, the makespan reduction by 8-Worker heterogeneous cluster (with 60 jobs workload) is better than the homogeneous clusters. In the case of 4-Worker cluster with 20 Jobs settings, FlowCon 2.0 reduced the completion time for 18 out of 20 jobs and also achieves the largest completion time and makespan reduction. This indicates the system is more efficient with FlowCon 2.0 even under multi-cluster settings with high workloads.
Computer science|Information science
Sharma, Vaishali, "Dynamic Resource Management Schemes for Containerized Deep Learning Applications" (2022). ETD Collection for Fordham University. AAI28965271.