Parallel Scheduling of Flux Calculations
FACETS is designed to have multiple flux models, and each flux model will run only on specified flux surfaces in the core. Flux models are expected to be flexible enough to run multiple flux surfaces on a single processor or a single flux surface on several processors. This leaves us with a challenge about how to run the flux models in parallel in the optimal amount of time.
General Problem Description
The problem of allocating processors (more precisely MPI ranks) to flux models can be generalized into the following general parallel resource scheduling problem.
Given:
- n - MPI ranks (may correspond to cores or nodes)
- p - parallel tasks
- Wi - a measure linearly proportional to the computational work required to complete parallel task i for i greater than or equal to 1 and less than or equal to p. This is a real valued parameter, and Wi = 2 Wj implies that task i takes twice as long as task j when each is allocated the same number of MPI ranks.
- Li - an integer lower bound on the number of MPI ranks that parallel task i (i in 1..p) requires (often 1).
- Ui - an integer upper bound on the number of MPI ranks that parallel task i (i in 1..p) can effectively utilize
- Di - the number of MPI ranks assigned to parallel task i must by integer divisible by Di. In many cases, Di is 1 which means any number of ranks satisfying the minimum and maximum constraint is legal. A task requiring an even number of MPI ranks would have Di of 2.
It is assumed that each FACETS flux components can specify values for L, U, and D.
In thinking about the problem, it seems that we also need to have some indication of the parallel efficiency of the parallel tasks. If all parallel tasks are perfectly scalable, have n integer divisible by Di, and have Ui greater than or equal to n, one optimal solution is to give the all processors to each flux calculation serially (i.e., task 1 runs on n MPI ranks, followed by task 2 running on n MPI ranks, etc.).
The simplest thing to do initially is to assume linear speed up with a constant efficiency, E, that's less than 1. If we take Ri as the number of MPI ranks assigned to parallel task i, the time required to complete task i, Ti, equals Wi / ( E * Ri' ).
The objective is to find the optimal processor allocation and computing schedule to minimize the time required to complete all p parallel tasks.
Thoughts and Observations About the Solution of the Parallel Allocation/Scheduling Problem
I haven't proven it, but this problem seems more difficult than other known NP-Hard problems (e.g., the bin packing problem). Hence, I would say that an algorithm to produce a globally optimum solution is not in the cards.
Even a relatively simple problem involving 4 tasks and 4 MPI ranks can have vastly different optimal solutions depending on the values of Wi. If the values of W are relatively similar, the optimal solution is assigning one processor per task.
However, if W1 is 12 and the other W values are around 1, assigning one rank per task gives a highly non-optimal result.
One optimal solution is to run task 1 in parallel on processors 1-3 while running tasks 2-4 on processor 4 in serial.
I am guessing that the number of tasks is going to be in the tens, and the number of processors could be in the tens of thousands. With these kinds of numbers, finding the global optimal will most likely take longer than we would like to take.
Attachments
-
LoadBalance1.png
(2.3 KB) - added by tepperly
2 years ago.
-
LoadBalance.svg
(4.1 KB) - added by tepperly
2 years ago.
-
LoadBalance1.2.png
(7.3 KB) - added by tepperly
2 years ago.
-
LoadBalance1.svg
(6.4 KB) - added by tepperly
2 years ago.
-
LoadBalance2.svg
(7.1 KB) - added by tepperly
2 years ago.
-
LoadBalance2.png
(6.4 KB) - added by tepperly
2 years ago.
-
LoadBalance3.png
(6.5 KB) - added by tepperly
2 years ago.
-
LoadBalance3.svg
(8.1 KB) - added by tepperly
2 years ago.



