SlurmClusterResolver - LostTech.TensorFlow Documentation

Type SlurmClusterResolver

Namespace tensorflow.distribute.cluster_resolver

ClusterResolver for system with Slurm workload manager.

This is an implementation of cluster resolvers for Slurm clusters. This allows the specification of jobs and task counts, number of tasks per node, number of GPUs on each node and number of GPUs for each task. It retrieves system attributes by Slurm environment variables, resolves allocated computing node names, constructs a cluster and returns a ClusterResolver object which can be use for distributed TensorFlow.

Methods

Properties

Public instance methods

ValueTuple<object, object> get_task_info()

Returns job name and task_id for the process which calls this.

This returns the job name and task index for the process which calls this function according to its rank and cluster specification. The job name and task index are set after a cluster is constructed by cluster_spec otherwise defaults to None.

Returns

ValueTuple<object, object>: A string specifying job name the process belongs to and an integner specifying the task index the process belongs to in that job.

object get_task_info_dyn()

Returns job name and task_id for the process which calls this.

Returns

object: A string specifying job name the process belongs to and an integner specifying the task index the process belongs to in that job.

Public static methods

SlurmClusterResolver NewDyn(object jobs, ImplicitContainer<T> port_base, ImplicitContainer<T> gpus_per_node, ImplicitContainer<T> gpus_per_task, object tasks_per_node, ImplicitContainer<T> auto_set_gpu, ImplicitContainer<T> rpc_layer)

Creates a new SlurmClusterResolver object.

This takes in parameters and creates a SlurmClusterResolver object. It uses those parameters to check which nodes will processes reside on and resolves their hostnames. With the number of the GPUs on each node and number of GPUs for each task it offsets the port number for each process and allocates GPUs to tasks by setting environment variables. The resolver currently supports homogeneous tasks and default Slurm process allocation.

Parameters

object jobs: Dictionary with job names as key and number of tasks in the job as value.
ImplicitContainer<T> port_base: The first port number to start with for processes on a node.
ImplicitContainer<T> gpus_per_node: Number of GPUs available on each node.
ImplicitContainer<T> gpus_per_task: Number of GPUs to be used for each task.
object tasks_per_node: Number of tasks to run on each node, if not set defaults to Slurm's output environment variable SLURM_NTASKS_PER_NODE.
ImplicitContainer<T> auto_set_gpu: Set the visible CUDA devices automatically while resolving the cluster by setting CUDA_VISIBLE_DEVICES environment variable. Defaults to True.
ImplicitContainer<T> rpc_layer: (Optional) The protocol TensorFlow uses to communicate between nodes. Defaults to 'grpc'.

Returns

SlurmClusterResolver: A ClusterResolver object which can be used with distributed TensorFlow.

LostTech.TensorFlow : API Documentation

Methods

Properties

Public instance methods

ValueTuple<object, object> get_task_info()

Returns

object get_task_info_dyn()

Returns

Public static methods

SlurmClusterResolver NewDyn(object jobs, ImplicitContainer<T> port_base, ImplicitContainer<T> gpus_per_node, ImplicitContainer<T> gpus_per_task, object tasks_per_node, ImplicitContainer<T> auto_set_gpu, ImplicitContainer<T> rpc_layer)

Parameters

Returns

Public properties

string environment get;

object environment_dyn get;

object PythonObject get;

string rpc_layer get; set;

Nullable<int> task_id get; set;

object task_type get; set;