Slurm MPI Plugin API

Overview

This document describes Slurm MPI selection plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own Slurm node selection plugins.

Slurm MPI selection plugins are Slurm plugins that implement the which version of mpi is used during execution of the new Slurm job. API described herein. They are intended to provide a mechanism for both selecting MPI versions for pending jobs and performing any mpi-specific tasks for job launch or termination. The plugins must conform to the Slurm Plugin API with the following specifications:

const char plugin_type[]
The major type must be "mpi." The minor type can be any recognizable abbreviation for the type of node selection algorithm. We recommend, for example:

  • pmi2 — For use with MPI2 and MVAPICH2.
  • pmix — Exascale PMI implementation (currently supported by OpenMPI starting from version 2.0)
  • none — For use with most other versions of MPI.

const char plugin_name[]
Some descriptive name for the plugin. There is no requirement with respect to its format.

const uint32_t plugin_version
If specified, identifies the version of Slurm used to build this plugin and any attempt to load the plugin from a different version of Slurm will result in an error. If not specified, then the plugin may be loaded by Slurm commands and daemons from any version, however this may result in difficult to diagnose failures due to changes in the arguments to plugin functions or changes in other Slurm functions used by the plugin.

A simplified flow of logic follows:

  • srun is able to specify the correct mpi to use with --mpi=MPITYPE
  • srun command runs p_mpi_hook_client_prelaunch() which will set up the correct environment for the specified mpi.
  • slurmstepd process runs p_mpi_hook_slurmstepd_prefork() which will set configure the slurmd to use the correct mpi as well to interact with the srun.
  • slurmstepd process runs p_mpi_hook_slurmstepd_task() which executes immediately before fork/exec of each task.
  • srun command runs p_mpi_hook_client_fini() which executes after all tasks have finished.

Data Objects

These functions are expected to read and/or modify data structures directly in the slurmd daemon's and srun memory. Slurmd is a multi-threaded program with independent read and write locks on each data structure type. Therefore the type of operations permitted on various data structures is identified for each function.

Environment Variables

Slurm will set the following environment variables for plugins:

  • SLURM_MPI_TYPE — MPI plugin name that has been loaded for job.

API Functions

The following functions should be defined or at least be stubbed.

int init (void)

Description:
Called when the plugin is loaded or reloaded, before any other functions are called. Put global initialization here.

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

void fini (void)

Description:
Called when the plugin is removed or reloaded. Clear any allocated storage here.

Returns: None.

Note: These init and fini functions are not the same as those described in the dlopen (3) system library. The C run-time system co-opts those symbols for its own initialization. The system _init() is called before the Slurm init(), and the Slurm fini() is called before the system's _fini().

mpi_plugin_client_state_t* p_mpi_hook_client_prelaunch (const mpi_plugin_client_info_t *job, char ***env)

Description: Called by srun to configure the slurmd's environment to that of the correct mpi.

Arguments:
job    (input) Pointer to the Job Step (stepd_step_rec_t) that about to run. Cannot be NULL.
env    (input/output) Pointer to pointer of job environment to allow plugin to modify job environment as needed.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

int p_mpi_hook_slurmstepd_prefork(const stepd_step_rec_t *job, char ***env)

Description: Called by slurmstepd before forking to create the first job process. Most all the real processing happens here. This is not called for batch jobs and extern steps.

Arguments:
job    (input) Pointer to the Job Step (stepd_step_rec_t) that about to run. Cannot be NULL.
env    (input/output) Pointer to pointer of job environment to allow plugin to modify job environment as needed.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

void p_mpi_hook_slurmstepd_task(const mpi_plugin_task_info_t *job, char ***env)

Description:  Called by slurmstepd process immediately after fork and becoming job user, but immediatly prior to exec of user task. This is not called for batch job steps and extern steps.

Arguments:
job    (input) Pointer to the Job Step (stepd_step_rec_t) that about to run. Cannot be NULL.
env    (input/output) Pointer to pointer of job environment to allow plugin to modify job environment as needed.

Returns: void returning function.

int p_mpi_hook_client_fini(mpi_plugin_client_state_t *state);

Description: Called by srun after all tasks are complete. Cleans up anything that needs cleaning up after execution.

Arguments:
state Launch state of MPI. Currently, a typedef of void.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR, causing slurmctld to exit.

Last modified 15 December 2018