Elastic Job Bundling: An Adaptive Resource Request Strategy for Large-Scale Parallel Applications
In today’s batch queue HPC cluster systems, the user submits a job requesting a ﬁxed number of processors. The system will not start the job until all of the requested resources become available simultaneously. When cluster workload is high, large sized jobs will experience long waiting time due to this policy. In this paper, we propose a new approach that dynamically decomposes a large job into smaller ones to reduce waiting time, and lets the application expand across multiple subjobs while continuously achieving progress. This approach has three beneﬁts: (i) application turnaround time is reduced, (ii) system fragmentation is diminished, and (iii) fairness is promoted. Our approach does not depend on job queue time prediction but exploits available backﬁll opportunities. Simulation results have shown that our approach can reduce application mean turnaround time by up to 48%.