Efficient Resource Management for Deep Learning Applications with Virtual Containers

Wenjia Zheng, Fordham University


The explosion of data has transformed the world since much more information is available for collection and analysis than ever before. To extract valuable information from the data in different dimensions, various deep learning models have been developed in the past years. Although these models have demonstrated their strong capability on improving products and services in various applications, training them is still a time-consuming and resource-intensive process. Presently, cloud, one of the most powerful computing infrastructures, has been used for the training. However, how to manage cloud computing resources and to perform the training efficiently is still challenging current techniques. For example, general resource scheduling approaches, such as spread priority and balanced resource schedulers, actually do not work well with deep learning workloads. Besides, the resource allocation problem on a cluster can be divide into two subproblems: (1) local resource optimization: improve resource configuration for a single machine; (2) global resource optimization: improve the cluster-wide resource allocation. In this thesis, we propose two novel container schedulers, FlowCon and SpeCon, that are designed to address these two subproblems respectively and specifically to optimize performance of short-lived deep learning applications in the cloud. FlowCon focuses on resource configuration of single-node in a cluster, as show that it efficiently improves deep learning tasks completion time and resource utilization, and reduces the completion time of a specific job by up to 42.06% without sacrificing the overall system time. SpeCon targets on cluster-wide resource configuration that speculatively migrate slow-growing models to release resources for fast-growing ones. Based on our experiments, SpeCon improves makespan for up to 24.7%, compared to current approaches.

Subject Area

Computer science|Information Technology

Recommended Citation

Zheng, Wenjia, "Efficient Resource Management for Deep Learning Applications with Virtual Containers" (2020). ETD Collection for Fordham University. AAI27960512.