Slurm Burst Buffer Guide

Overview

Slurm includes support for burst buffers, a shared high-speed storage resource. Slurm provides support for allocating these resources, staging files in, scheduling compute nodes for jobs using these resources, then staging files out. Burst buffers can also be used as temporary storage during a job's lifetime, without file staging. Another typical use case is for persistent storage, not associated with any specific job. This support is provided using a plugin mechanism so that a various burst buffer infrastructures may be easily configured. One plugin is provided currently:

  1. datawarp - Uses Cray APIs to perform underlying management functions

Additional plugins may be provided in future releases of Slurm.

Slurm's mode of operation follows this general pattern:

  1. Read configuration information and initial state information
  2. After expected start times for pending jobs are established, allocate burst buffers to those jobs expected to start earliest and start stage-in of required files
  3. After stage-in has completed, jobs can be allocated compute nodes and begin execution
  4. After job has completed execution, begin file stage-out from burst buffer
  5. After file stage-out has completed, burst buffer can be released and the job record purged

Configuration (for system administrators)

Burst buffer support in Slurm is enabled by specifying the plugin(s) to load for managing these resources using the BurstBufferType configuration parameter in the slurm.conf file. Multiple plugin names may be specified in a comma separated list. Detailed logging of burst buffer specific actions may be generated for debugging purposes by using the DebugFlags=BurstBuffer configuration parameter. The BurstBuffer DebugFlags (like many other DebugFlags) can result in very verbose logging and is only intended for diagnostic purposes rather than for use in a production system.

# Excerpt of example slurm.conf file
BurstBufferType=burst_buffer/datawarp
# DebugFlags=BurstBuffer # Commented out

Burst buffer specific options should be defined in a burst_buffer.conf file. This file can contain information about who can or can not use burst buffers, timeouts, and paths of scripts used to perform various functions, etc. TRES limits can be configured to establish limits by association, QOS, etc. The size of a job's burst buffer requirements can be used as a factor in setting the job priority as described in the multifactor priority document.

Note for Cray systems: The JSON-C library must be installed in order to build Slurm's burst_buffer/datawarp plugin, which must parse JSON format data. See Slurm's JSON installation information for details.

Job Submission Commands

The normal mode of operation is for batch jobs to specify burst buffer requirements within the batch script. Batch script lines containing a prefix of "#BB" identify the job's burst buffer space requirements, files to be staged in, files to be staged out, etc. when using the burst_buffer/generic plugin. The prefix of "#DW" (for "DataWarp") is used for burst buffer directives when using the burst_buffer/datawarp plugin. Please reference Cray documentation for details about the DataWarp options. For DataWarp systems, the prefix of "#BB" can be used to create or delete persistent burst buffer storage (NOTE: The "#BB" prefix is used since the command is interpreted by Slurm and not by the Cray Datawarp software). Interactive jobs (those submitted using the salloc and srun commands) can specify their burst buffer space requirements using the "--bb" or "--bbf" command line options, as described later in this document. All of the "#SBATCH", "#DW" and "#BB" directives should be placed at the top of the script (before any non-comment lines). All of the persistent burst buffer creations and deletions happen before the job's compute portion happens. In a similar fashion, you can't stage files in/out at various points in the script execution. All file stage-in happens prior to the job's compute portion and all file stage-out happens after computation.

The salloc and srun commands can create and use job-specific burst buffers. For both commands, the "--bb" or "--bbf" option is used to specify the job's burst buffer requirements. Note the burst buffer may not be accessible from a login node, but require that salloc spawn a shell on one of its allocated compute nodes.

A basic validation is performed on the job's burst buffer options at job submit time. If the options are invalid, the job will be rejected and an error message will be returned directly to the user. Note that unrecognized options may be ignored in order to support backward compatibility (i.e. a job submission would not fail in the case of an option being specified that is recognized by some versions of Slurm, but not recognized by other versions). If the job is accepted, but later fails (e.g. some problem staging files), the job will be held and its "Reason" field will be set to error message provided by the underlying infrastructure.

Users may also request to be notified by email upon completion of burst buffer stage out using the "--mail-type=stage_out" or "--mail-type=all" option. The subject line of the email will be of this form:

SLURM Job_id=12 Name=my_app Staged Out, StageOut time 00:05:07

Persistent Burst Buffer Creation and Deletion Directives

These options are used for both the burst_buffer/datawarp and burst_buffer/generic plugins to create and delete persistent burst buffers, which have a lifetime independent of the job.

  • #BB create_persistent name=<name> capacity=<number> [access=<access>] [pool=<pool> [type=<type>]
  • #BB destroy_persistent name=<name> [hurry]

The persistent burst buffer name may not start with a numeric value (numeric names are reserved for job-specific burst buffers). The capacity (size) specification can include a suffix of "N" (nodes), "K|KiB", "M|MiB", "G|GiB", "T|TiB", "P|PiB" (for powers of 1024) and "KB", "MB", "GB", "TB", "PB" (for powers of 1000). NOTE: Usually Slurm interprets KB, MB, GB, TB, PB, TB units as powers of 1024, but for Burst Buffers size specifications Slurm supports both IEC/SI formats. This is because the CRAY API supports both formats. The access parameter identifies the buffer access mode. Supported access modes for the burst_buffer/datawarp plugin include: striped, private, and ldbalance. The pool parameter identifies the resource pool from which the burst buffer should be created. The default and available pools are configuration dependent. The type parameter identifies the buffer type. Supported type modes for the burst_buffer/datawarp plugin include: cache and scratch. Multiple persistent burst buffers may be created or deleted within a single job. A sample batch script follows:

#!/bin/bash
#BB create_persistent name=alpha capacity=32GB access=striped type=scratch
#DW jobdw type=scratch capacity=1GB access_mode=striped
#DW stage_in  type=file source=/home/alan/data.in  destination=$DW_JOB_STRIPED/data
#DW stage_out type=file destination=/home/alan/data.out source=$DW_JOB_STRIPED/data
/home/alan/a.out

Persistent burst buffers can be created and deleted by a job requiring no compute resources. Submit a job with the desired burst buffer directives and specify a node count of zero (e.g. "sbatch -N0 setup_buffers.bash"). Attempts to submit a zero size job without burst buffer directives or with job-specific burst buffer directives will generate an error. Note that zero size jobs are not supported for job arrays or heterogeneous job allocations.

NOTE: The ability to create and destroy persistent burst buffers may be limited by the "Flags" option in the burst_buffer.conf file. See the burst_buffer.conf man page for more information. By default only privileged users (i.e. Slurm operators and administrators) can create or destroy persistent burst buffers.

Interactive Job Options

Interactive jobs may include directives for creating job-specific burst buffers as well as file staging. These options may be specified using either the "--bb" or "--bbf" option of the salloc or srun command. Note that support for creation and destruction of persistent burst buffers using the "--bb" option is not provided. The "--bbf" option take as an argument a filename and that file should contain a collection of burst buffer operations identical to that used for batch jobs. This file may contain file staging directives. Alternately the "--bb" option may be used to specify burst buffer directives as the option argument. The format of those directives can either be identical to those used in a batch script OR a very limited set of directives can be used, which are translated to the equivalent script for later processing. Multiple directives should be space separated.

  • access=<access>
  • capacity=<number>
  • swap=<number>
  • type=<type>
  • pool=<name>

If a swap option is specified, the job must also specify the required node count. The capacity (size) specification can include a suffix of "N" (nodes), "K|KiB", "M|MiB", "G|GiB", "T|TiB", "P|PiB" (for powers of 1024) and "KB", "MB", "GB", "TB", "PB" (for powers of 1000). NOTE: Usually Slurm interprets KB, MB, GB, TB, PB, TB units as powers of 1024, but for Burst Buffers size specifications Slurm supports both IEC/SI formats. This is because the CRAY API supports both formats. A sample command line follows and we also show the equivalent burst buffer script generated by the options:

# Sample execute line:
srun --bb="capacity=1G access=striped type=scratch" a.out

# Equivalent script as generated by Slurm's burst_buffer/datawarp plugin
#DW jobdw capacity=1GiB access_mode=striped type=scratch

Symbol Replacement

Slurm supports a number of symbols that can be used to automatically fill in certain job details, e.g., to make stage_in or stage_out directory paths vary with each job submission.

Supported symbols include:

%%%
%AArray Master Job Id
%aArray Task Id
%dWorkdir
%jJob Id
%uUser Name
%xJob Name
\\Stop further processing of the line

Status Commands

Slurm's current burst buffer state information is available using the scontrol show burst command or by using the sview command's Burst Buffer tab. A sample scontrol output is shown below. The scontrol "-v" option may be used for a more verbose output format.

$ scontrol show burst
Name=generic DefaultPool=ssd Granularity=100G TotalSpace=50T UsedSpace=42T
  StageInTimeout=30 StageOutTimeout=30 Flags=EnablePersistent,PrivateData
  AllowUsers=alan,brenda
  CreateBuffer=/usr/local/slurm/17.11/sbin/CB
  DestroyBuffer=/usr/local/slurm/17.11/sbin/DB
  GetSysState=/usr/local/slurm/17.11/sbin/GSS
  StartStageIn=/usr/local/slurm/17.11/sbin/SSI
  StartStageIn=/usr/local/slurm/17.11/sbin/SSO
  StopStageIn=/usr/local/slurm/17.11/sbin/PSI
  StopStageIn=/usr/local/slurm/17.11/sbin/PSO
  Allocated Buffers:
    JobID=18 CreateTime=2017-08-19T16:46:05 Pool=dwcache Size=10T State=allocated UserID=alan(1000)
    JobID=20 CreateTime=2017-08-19T16:46:45 Pool=dwcache Size=10T State=allocated UserID=alan(1000)
    Name=DB1 CreateTime=2017-08-19T16:46:45 Pool=dwcache Size=22T State=allocated UserID=brenda(1001)
  Per User Buffer Use:
    UserID=alan(1000) Used=20T
    UserID=brenda(1001) Used=22T

Access to the Cray burst buffer status tool, dwstat, is available from the scontrol command using the "scontrol show bbstat ..." or "scontrol show dwstat ..." command. Options following "bbstat" or "dwstat" on the scontrol execute line are passed directly to the dwstat command as shown below. See Cray DataWarp documentation for details about dwstat options and output.

/opt/cray/dws/default/bin/dwstat
$ scontrol show dwstat
    pool units quantity    free gran'
wlm_pool bytes  7.28TiB 7.28TiB 1GiB'

$ scontrol show dwstat sessions
 sess state      token creator owner             created expiration nodes
  832 CA---  783000000  tester 12345 2015-09-08T16:20:36      never    20
  833 CA---  784100000  tester 12345 2015-09-08T16:21:36      never     1
  903 D---- 1875700000  tester 12345 2015-09-08T17:26:05      never     0

$ scontrol show dwstat configurations
 conf state inst    type access_type activs
  715 CA---  753 scratch      stripe      1
  716 CA---  754 scratch      stripe      1
  759 D--T-  807 scratch      stripe      0
  760 CA---  808 scratch      stripe      1

Advanced Reservations

Burst buffer resources can be placed in an advanced reservation using the BurstBuffer option. The argument consists of four elements:
[plugin:][pool:]#[units]

plugin is the burst buffer plugin name, currently either "datawarp" or "generic". If no plugin is specified, the reservation applies to all configured burst buffer plugins.

pool specifies a Cray generic burst buffer resource pool. If "type" is not specified, the number is a measure of storage space.

units may be "N" (nodes), "K|KiB", "M|MiB", "G|GiB", "T|TiB", "P|PiB" (for powers of 1024) and "KB", "MB", "GB", "TB", "PB" (for powers of 1000). The default units are bytes for reservations of storage space. NOTE: Usually Slurm interprets KB, MB, GB, TB, PB, TB units as powers of 1024, but for Burst Buffers size specifications Slurm supports both IEC/SI formats. This is because the Cray DataWarp API supports both formats.

Jobs using this reservation are not restricted to these burst buffer resources, but may use these reserved resources plus any which are generally available. Some examples follow.

$ scontrol create reservation starttime=now duration=60 \
  users=alan flags=any_nodes \
  burstbuffer=datawarp:100G,generic:20G

$ scontrol create reservation StartTime=noon duration=60 \
  users=brenda NodeCnt=8 \
  BurstBuffer=datawarp:20G

$ scontrol create reservation StartTime=16:00 duration=60 \
  users=joseph flags=any_nodes \
  BurstBuffer=datawarp:pool_test:4G

Job Dependencies

If two jobs use burst buffers and one is dependent on the other (e.g. "sbatch --dependency=afterok:123 ...") then the second job will not begin until the first job completes and its burst buffer stage out completes. If the second job does not use a burst buffer, but is dependent upon the first job's completion, then it will not wait for the stage out operation of the first job to complete. The second job can be made to wait for the first job's stage out operation to complete using the "afterburstbuffer" dependency option (e.g. "sbatch --dependency=afterburstbuffer:123 ...").

Last modified 15 October 2020