Changes to support TPUs #1

ultrons · 2020-11-03T01:04:43Z

For Internal review.

mmf/trainers/core/training_loop.py

mmf/trainers/mmf_trainer.py

mmf/utils/build.py

mmf_cli/run.py

initial changes to support training on tpus changed tpu configuration to use training.device replace parallelLoader with mpLoader to solved loader exhaust issue. removed debug message. updated the comment added comments for drop_last change. removed pdb lines removed redundant device config added comments for pending changes default init not applicable for xla device type moved wrapping of dataloader to build added line-debug function metsumm removed some .item calls from reporting xla equivalents in the distributed module, earlier eval was failing at the metrics all reduce step implemented broadcast in terms of all_to_all changes for checkpoint saving change to make execution even across cores corrected the is_master logic one more fix for is_master clean up of debug messages

taylanbil · 2020-11-23T22:49:20Z

mmf/common/meter.py

@@ -60,8 +60,8 @@ def update(self, update_dict, batch_size):
            if isinstance(v, torch.Tensor):
                if v.dim() != 0:
                    v = v.mean()
-                v = v.item()
-            assert isinstance(v, (float, int))
+                #v = v.item()


instead of commenting out here, let's have a util function like

def item(self, v): if torch.is_tensor(v) and v.device.type == 'xla': return v return v.item()

and use v = self.item(v) and then assert on assert isinstance(v, (float, int)) or v.device.type == 'xla'

taylanbil · 2020-11-23T22:51:02Z

mmf/common/sample.py

+    # since other device types such as xla can be passed
+    # falling back to cpu should only happen when device_type
+    # is set to cude but cuda is not available.
+    if not torch.cuda.is_available() and device == "cuda":


ordering as device == 'cude' and torch.cuda.is_available() will save you the cuda available check.

taylanbil · 2020-11-23T22:52:21Z

mmf/datasets/multi_dataset_loader.py

@@ -186,9 +186,15 @@ def _infer_dataset_probabilities(self):
    def __len__(self):
        # Since, this is iterator, we need to return total length == number of batches
        batch_size = get_batch_size()
+        # Changed the length to accomadate drop_last == True
+        # drop_last is required if the batch is split intor multiple cores


s/intor/into/

taylanbil · 2020-11-23T22:52:39Z

mmf/datasets/multi_dataset_loader.py

+        # Changed the length to accomadate drop_last == True
+        # drop_last is required if the batch is split intor multiple cores
+        # some of the cores may not have enough examples.
+        if is_xla():


can you use thee bool drop_last here instead of is_xla?

taylanbil · 2020-11-23T22:53:36Z

mmf/trainers/core/device.py

+            self.device = xm.xla_device()
+            self.distributed = True
+            self.local_rank = xm.get_local_ordinal()
+            self.tpu = True


I think using self.xla to denote xla usage is better than using self.tpu

taylanbil · 2020-11-23T23:44:40Z

mmf/utils/early_stopping.py

@@ -46,7 +46,7 @@ def __call__(self, update, iteration, meter):
        Returns:
            bool -- Tells whether early stopping occurred or not
        """
-        if not is_master():
+        if not is_master() and not is_xla():


I don't understand this. why always False if not the master ordinal?

taylanbil · 2020-11-23T23:44:55Z

mmf_cli/run.py

@@ -32,6 +32,7 @@ def main(configuration, init_distributed=False, predict=False):
    if init_distributed:
        distributed_init(config)

+


blank line, remove.

taylanbil · 2020-11-23T23:45:05Z

mmf_cli/run.py

@@ -96,6 +97,7 @@ def run(opts: typing.Optional[typing.List[str]] = None, predict: bool = False):
    if config.distributed.init_method is None:
        infer_init_method(config)

+


remove blank line

taylanbil · 2020-11-23T23:45:22Z

mmf_cli/run.py

-        )
+        if is_xla():
+            import torch_xla.distributed.xla_multiprocessing as xmp
+            torch.multiprocessing.set_sharing_strategy("file_system")


mmf_cli/run.py

taylanbil reviewed Nov 3, 2020

View reviewed changes

mmf/trainers/core/training_loop.py Show resolved Hide resolved

mmf/trainers/mmf_trainer.py Outdated Show resolved Hide resolved

mmf/utils/build.py Outdated Show resolved Hide resolved

mmf/utils/build.py Outdated Show resolved Hide resolved

mmf_cli/run.py Show resolved Hide resolved

ultrons force-pushed the pytorch-xla branch from 6fa3b80 to a6bc640 Compare November 21, 2020 02:40

taylanbil reviewed Nov 23, 2020

View reviewed changes

fix for initial data/checkpoint download

d06f077

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to support TPUs #1

Changes to support TPUs #1

ultrons commented Nov 3, 2020

taylanbil Nov 23, 2020

taylanbil Nov 23, 2020

taylanbil Nov 23, 2020

taylanbil Nov 23, 2020

taylanbil Nov 23, 2020

taylanbil Nov 23, 2020

taylanbil Nov 23, 2020

taylanbil Nov 23, 2020

taylanbil Nov 23, 2020

		@@ -32,6 +32,7 @@ def main(configuration, init_distributed=False, predict=False):
		if init_distributed:
		distributed_init(config)

		@@ -96,6 +97,7 @@ def run(opts: typing.Optional[typing.List[str]] = None, predict: bool = False):
		if config.distributed.init_method is None:
		infer_init_method(config)

Changes to support TPUs #1

Are you sure you want to change the base?

Changes to support TPUs #1

Conversation

ultrons commented Nov 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment