LostTech.TensorFlow : API Documentation

Type SlurmClusterResolver

Namespace tensorflow.distribute.cluster_resolver

Parent ClusterResolver

Interfaces ISlurmClusterResolver

ClusterResolver for system with Slurm workload manager.

This is an implementation of cluster resolvers for Slurm clusters. This allows the specification of jobs and task counts, number of tasks per node, number of GPUs on each node and number of GPUs for each task. It retrieves system attributes by Slurm environment variables, resolves allocated computing node names, constructs a cluster and returns a ClusterResolver object which can be use for distributed TensorFlow.

Methods

Properties

Public instance methods

ValueTuple<object, object> get_task_info()

Returns job name and task_id for the process which calls this.

This returns the job name and task index for the process which calls this function according to its rank and cluster specification. The job name and task index are set after a cluster is constructed by cluster_spec otherwise defaults to None.
Returns
ValueTuple<object, object>
A string specifying job name the process belongs to and an integner specifying the task index the process belongs to in that job.

object get_task_info_dyn()

Returns job name and task_id for the process which calls this.

This returns the job name and task index for the process which calls this function according to its rank and cluster specification. The job name and task index are set after a cluster is constructed by cluster_spec otherwise defaults to None.
Returns
object
A string specifying job name the process belongs to and an integner specifying the task index the process belongs to in that job.

Public static methods

SlurmClusterResolver NewDyn(object jobs, ImplicitContainer<T> port_base, ImplicitContainer<T> gpus_per_node, ImplicitContainer<T> gpus_per_task, object tasks_per_node, ImplicitContainer<T> auto_set_gpu, ImplicitContainer<T> rpc_layer)

Creates a new SlurmClusterResolver object.

This takes in parameters and creates a SlurmClusterResolver object. It uses those parameters to check which nodes will processes reside on and resolves their hostnames. With the number of the GPUs on each node and number of GPUs for each task it offsets the port number for each process and allocates GPUs to tasks by setting environment variables. The resolver currently supports homogeneous tasks and default Slurm process allocation.
Parameters
object jobs
Dictionary with job names as key and number of tasks in the job as value.
ImplicitContainer<T> port_base
The first port number to start with for processes on a node.
ImplicitContainer<T> gpus_per_node
Number of GPUs available on each node.
ImplicitContainer<T> gpus_per_task
Number of GPUs to be used for each task.
object tasks_per_node
Number of tasks to run on each node, if not set defaults to Slurm's output environment variable SLURM_NTASKS_PER_NODE.
ImplicitContainer<T> auto_set_gpu
Set the visible CUDA devices automatically while resolving the cluster by setting CUDA_VISIBLE_DEVICES environment variable. Defaults to True.
ImplicitContainer<T> rpc_layer
(Optional) The protocol TensorFlow uses to communicate between nodes. Defaults to 'grpc'.
Returns
SlurmClusterResolver
A ClusterResolver object which can be used with distributed TensorFlow.

Public properties

string environment get;

object environment_dyn get;

object PythonObject get;

string rpc_layer get; set;

Nullable<int> task_id get; set;

object task_type get; set;