Type MultiWorkerMirroredStrategy
Namespace tensorflow.distribute.experimental
Parent Strategy
Interfaces IMultiWorkerMirroredStrategy
A distribution strategy for synchronous training on multiple workers. This strategy implements synchronous distributed training across multiple
workers, each with potentially multiple GPUs. Similar to
tf.distribute.MirroredStrategy
, it creates copies of all variables in the
model on each device across all workers. It uses CollectiveOps's implementation of multi-worker all-reduce to
to keep variables in sync. A collective op is a single op in the
TensorFlow graph which can automatically choose an all-reduce algorithm in
the TensorFlow runtime according to hardware, network topology and tensor
sizes. By default it uses all local GPUs or CPU for single-worker training. When 'TF_CONFIG' environment variable is set, it parses cluster_spec,
task_type and task_id from 'TF_CONFIG' and turns into a multi-worker strategy
which mirrores models on GPUs of all machines in a cluster. In the current
implementation, it uses all GPUs in a cluster and it assumes all workers have
the same number of GPUs. It supports both eager mode and graph mode. However, for eager mode, it has to
set up the eager context in its constructor and therefore all ops in eager
mode have to run after the strategy object is created.