Type SlurmClusterResolver
Namespace tensorflow.distribute.cluster_resolver
Parent ClusterResolver
Interfaces ISlurmClusterResolver
ClusterResolver for system with Slurm workload manager. This is an implementation of cluster resolvers for Slurm clusters. This allows
the specification of jobs and task counts, number of tasks per node, number of
GPUs on each node and number of GPUs for each task. It retrieves system
attributes by Slurm environment variables, resolves allocated computing node
names, constructs a cluster and returns a ClusterResolver object which can be
use for distributed TensorFlow.
Methods
Properties
Public instance methods
ValueTuple<object, object> get_task_info()
Returns job name and task_id for the process which calls this. This returns the job name and task index for the process which calls this
function according to its rank and cluster specification. The job name and
task index are set after a cluster is constructed by cluster_spec otherwise
defaults to None.
Returns
-
ValueTuple<object, object>
- A string specifying job name the process belongs to and an integner specifying the task index the process belongs to in that job.
object get_task_info_dyn()
Returns job name and task_id for the process which calls this. This returns the job name and task index for the process which calls this
function according to its rank and cluster specification. The job name and
task index are set after a cluster is constructed by cluster_spec otherwise
defaults to None.
Returns
-
object
- A string specifying job name the process belongs to and an integner specifying the task index the process belongs to in that job.
Public static methods
SlurmClusterResolver NewDyn(object jobs, ImplicitContainer<T> port_base, ImplicitContainer<T> gpus_per_node, ImplicitContainer<T> gpus_per_task, object tasks_per_node, ImplicitContainer<T> auto_set_gpu, ImplicitContainer<T> rpc_layer)
Creates a new SlurmClusterResolver object. This takes in parameters and creates a SlurmClusterResolver object. It uses
those parameters to check which nodes will processes reside on and resolves
their hostnames. With the number of the GPUs on each node and number of GPUs
for each task it offsets the port number for each process and allocates
GPUs to tasks by setting environment variables. The resolver currently
supports homogeneous tasks and default Slurm process allocation.
Parameters
-
object
jobs - Dictionary with job names as key and number of tasks in the job as value.
-
ImplicitContainer<T>
port_base - The first port number to start with for processes on a node.
-
ImplicitContainer<T>
gpus_per_node - Number of GPUs available on each node.
-
ImplicitContainer<T>
gpus_per_task - Number of GPUs to be used for each task.
-
object
tasks_per_node - Number of tasks to run on each node, if not set defaults to Slurm's output environment variable SLURM_NTASKS_PER_NODE.
-
ImplicitContainer<T>
auto_set_gpu - Set the visible CUDA devices automatically while resolving the cluster by setting CUDA_VISIBLE_DEVICES environment variable. Defaults to True.
-
ImplicitContainer<T>
rpc_layer - (Optional) The protocol TensorFlow uses to communicate between nodes. Defaults to 'grpc'.
Returns
-
SlurmClusterResolver
- A ClusterResolver object which can be used with distributed TensorFlow.