MPI communication layout

Spawning MPI processes at runtime is probably not supported on EVE. We thus need to find ways to connect and distribute pre-allocated MPI processes.

Initialization

Use of pre-allocated nodes requires that Finam itself is started on each of those nodes. Based on which process/rank it is running on, it then needs to decide whether to run the normal Finam scheduler, or model workers. It is unclear how this can be managed, optimally by the model wrappers themselves without prior knowledge required by the user or the framework.

Possible layouts

Group rank 0 communicates with master

Each model is initialized with a communicator the connects all its assigned processes. The master (running Finam) is grouped with all model processes of rank 0. This communicator is also passed to all models. The models are responsible to communicate with the master to receive inputs and send outputs.

flowchart TB

subgraph Formind
  F0 --- |B| F1
  F0 --- |B| F2
  F0 --- |B| F3
  F0 --- |B| F4
end

subgraph OGS
  O0 --- |C| O1
  O0 --- |C| O2
  O0 --- |C| O3
end

master --- |A| F0
master --- |A| O0

Letters on connections denote communicators/groups/communication channels. Processes in each box can also communicate with each other, connections not shown.

All processes communicate with master

Each model is initialized with a communicator for a group comprising its assigned processes, and the master (as rank 0). Models are responsible for collecting results in master.

flowchart TB

subgraph Formind
  F1
  F2
  F3
  F4
end

subgraph OGS
  O1
  O2
  O3
end

master --- |A| F1
master --- |A| F2
master --- |A| F3
master --- |A| F4
master --- |B| O1
master --- |B| O2
master --- |B| O3

Edited Jul 08, 2021 by Martin Lange