Make torch xla available on GPU #29334

yitongh · 2024-02-28T02:44:41Z

What does this PR do?

Make torch xla available on GPU. Currently, torch xla can be used in a GPU environment, but there are some conflicts between XLA and native PyTorch CUDA when using an environment with torch xla installed. This PR introduces the environment variable USE_TORCH_XLA to address this issue. When USE_TORCH_XLA is set to false, native PyTorch CUDA can be used seamlessly, even if torch xla is installed.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerzr and @pacman100

yitongh · 2024-02-28T02:58:44Z

The main changes:

Change is_torch_tpu_available to is_torch_xla_available
Change require_torch_tpu to require_torch_xla
Add USE_TORCH_XLA to enable or disable torch_xla
Fix amp check
Move grad_norm.item() into _maybe_log_save_evaluate to prevent a performance degradation in XLA
Copy the xla_fsdp_config to avoid modifying the original config

yitongh · 2024-02-28T03:38:59Z

This PR is related to huggingface/accelerate#2176 and huggingface/accelerate#2467.

@will-cromar Could you please check if this PR has any impact on the TPU environment? Thanks.

yitongh · 2024-02-28T06:25:03Z

The ci only failed in tests/test_modeling_utils.py::ModelUtilsTest::test_use_safetensors. I run this test in master and it also hangs. From pystack, it looks like this issue related to ssl read. stack log: https://gist.github.com/yitongh/34dc9c9f3de79d208533964bd63bb6f5

HuggingFaceDocBuilderDev · 2024-03-04T14:44:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yitongh · 2024-03-05T02:18:48Z

@muellerzr, would you be available to take a look at this PR when you have a moment? Alternatively, if you're not, perhaps you could suggest someone else who might be suited to review it? Many thanks.

muellerzr

Overall this is fine by me, we have very similar logic in Accelerate if I'm not mistaken. Thanks!

muellerzr · 2024-03-07T17:50:23Z

Let's make sure we can fix those failing tests though, can you try rebasing from main?

yitongh · 2024-03-08T06:37:48Z

@muellerzr I have rebased from main. I rerun the failing in my machine, both main and this pr passed test_run_ner_no_trainer and test_run_squad_no_trainer, but failed at test_run_glue_no_trainer. It looks like not related to this pr.

pytest -s -v examples/pytorch/test_accelerate_examples.py::ExamplesTestsNoTrainer::test_run_ner_no_trainer examples/pytorch/test_accelerate_examples.py::ExamplesTestsNoTrainer::test_run_squad_no_trainer examples/pytorch/test_accelerate_examples.py::ExamplesTestsNoTrainer::test_run_glue_no_trainer
======================================================================================================== test session starts =========================================================================================================
platform linux -- Python 3.10.12, pytest-8.0.0, pluggy-1.4.0 -- /usr/bin/python3.10
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/root/hyt/github/transformers/.hypothesis/examples'))
rootdir: /root/hyt/github/transformers
configfile: pyproject.toml
plugins: hypothesis-6.98.17, xdist-3.5.0, subtests-0.12.1, anyio-4.3.0, timeout-2.3.1
collected 3 items

examples/pytorch/test_accelerate_examples.py::ExamplesTestsNoTrainer::test_run_ner_no_trainer PASSED
examples/pytorch/test_accelerate_examples.py::ExamplesTestsNoTrainer::test_run_squad_no_trainer PASSED
examples/pytorch/test_accelerate_examples.py::ExamplesTestsNoTrainer::test_run_glue_no_trainer FAILED

============================================================================================================== FAILURES ==============================================================================================================
__________________________________________________________________________________________ ExamplesTestsNoTrainer.test_run_glue_no_trainer ___________________________________________________________________________________________

self = <test_accelerate_examples.ExamplesTestsNoTrainer testMethod=test_run_glue_no_trainer>

    @mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"})
    def test_run_glue_no_trainer(self):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            {self.examples_dir}/pytorch/text-classification/run_glue_no_trainer.py
            --model_name_or_path distilbert/distilbert-base-uncased
            --output_dir {tmp_dir}
            --train_file ./tests/fixtures/tests_samples/MRPC/train.csv
            --validation_file ./tests/fixtures/tests_samples/MRPC/dev.csv
            --per_device_train_batch_size=2
            --per_device_eval_batch_size=1
            --learning_rate=1e-4
            --seed=42
            --num_warmup_steps=2
            --checkpointing_steps epoch
            --with_tracking
        """.split()

        run_command(self._launch_args + testargs)
        result = get_results(tmp_dir)
>       self.assertGreaterEqual(result["eval_accuracy"], 0.75)
E       AssertionError: 0.6666666666666666 not greater than or equal to 0.75

examples/pytorch/test_accelerate_examples.py:98: AssertionError
========================================================================================================== warnings summary ==========================================================================================================
../../../../usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1394
  /usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1394: PytestConfigWarning: Unknown config option: doctest_glob

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================================================== short test summary info =======================================================================================================
FAILED examples/pytorch/test_accelerate_examples.py::ExamplesTestsNoTrainer::test_run_glue_no_trainer - AssertionError: 0.6666666666666666 not greater than or equal to 0.75
========================================================================================= 1 failed, 2 passed, 1 warning in 143.07s (0:02:23) =========================================================================================

muellerzr · 2024-03-08T15:54:40Z

They look to be timeout issues, I'm rerunning the tests now. However if they still fail they were not failing before this 😅

muellerzr · 2024-03-08T15:59:37Z

Tests pass on our CI so looks to be fine

muellerzr

cc @amyeroberts for final review

amyeroberts

Thanks for working on this!

Mostly just small comments about the deprecation handling.

Main concern is that previously check_device was True by default. Therefore, replacing is_torch_tpu_available() with is_torch_xla_available() isn't an equivalent call.

amyeroberts · 2024-03-08T16:16:40Z

src/transformers/utils/import_utils.py

@@ -497,13 +513,33 @@ def is_torch_tpu_available(check_device=True):
            except RuntimeError:
                return False
        return True


The final return should remain

Suggested change

return True

return True

return False

amyeroberts · 2024-03-08T16:19:18Z

src/transformers/utils/__init__.py

@@ -188,7 +188,7 @@
    is_torch_sdpa_available,
    is_torch_tensorrt_fx_available,
    is_torch_tf32_available,
-    is_torch_tpu_available,
+    is_torch_xla_available,


We need to leave this as importable whilst it's still going through the deprecation cycle

Suggested change

is_torch_xla_available,

is_torch_tpu_available,

is_torch_xla_available,

amyeroberts · 2024-03-08T16:22:27Z

examples/legacy/seq2seq/seq2seq_trainer.py

@@ -135,7 +135,7 @@ def _get_lr_scheduler(self, num_training_steps):
    def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:
        if isinstance(self.train_dataset, torch.utils.data.IterableDataset):
            return None
-        elif is_torch_tpu_available():
+        elif is_torch_xla_available():


This isn't equivalent, previously, we were checking for a device, but by default that isn't happening anymore

This is to align with PR of the accelerate library. If users do not wish to use torch_xla in an environment where torch_xla is installed, they can configure it using USE_TORCH_XLA, which is also the purpose of this PR.

amyeroberts · 2024-03-08T16:23:59Z

src/transformers/trainer.py

@@ -2404,7 +2404,7 @@ def _maybe_log_save_evaluate(self, tr_loss, grad_norm, model, trial, epoch, igno

            logs["loss"] = round(tr_loss_scalar / (self.state.global_step - self._globalstep_last_logged), 4)
            if grad_norm is not None:
-                logs["grad_norm"] = grad_norm
+                logs["grad_norm"] = grad_norm.item() if torch.is_tensor(grad_norm) else grad_norm


This change doesn't seem to have anything to do with the goal of this pr

This modification is because tensor evaluation (grad_norm.item()) will cause XLA to execute the entire computation graph prematurely, resulting in decreased performance. The grad_norm.item() operation should be performed after the XLA mark_step.

amyeroberts · 2024-03-08T16:24:11Z

src/transformers/trainer.py

@@ -2016,7 +2016,7 @@ def _inner_training_loop(
                            if hasattr(grad_norm, "item"):
                                grad_norm = grad_norm.item()
                        else:
-                            grad_norm = _grad_norm.item() if _grad_norm is not None else None
+                            grad_norm = _grad_norm


amyeroberts · 2024-03-08T16:26:02Z

src/transformers/__init__.py

@@ -1090,8 +1090,8 @@
        "is_torch_available",
        "is_torch_neuroncore_available",
        "is_torch_npu_available",
-        "is_torch_tpu_available",
        "is_torchvision_available",


We need to keep whilst it's still being deprecated

Suggested change

"is_torchvision_available",

"is_torch_tpu_available",

"is_torchvision_available",

amyeroberts · 2024-03-08T16:26:14Z

src/transformers/__init__.py

@@ -5894,7 +5894,7 @@
        is_torch_available,
        is_torch_neuroncore_available,
        is_torch_npu_available,
-        is_torch_tpu_available,
+        is_torch_xla_available,


Suggested change

is_torch_xla_available,

is_torch_tpu_available,

is_torch_xla_available,

amyeroberts · 2024-03-08T16:30:17Z

src/transformers/utils/import_utils.py

@@ -484,6 +494,12 @@ def is_g2p_en_available():
 @lru_cache()
 def is_torch_tpu_available(check_device=True):
    "Checks if `torch_xla` is installed and potentially if a TPU is in the environment"
+    warnings.warn(
+        "`is_torch_tpu_available` is deprecated and will be removed in 4.39.0. "


This will be the next release - so would need to be removed now! As it's a public object, it should go through at least two cycles

Suggested change

"`is_torch_tpu_available` is deprecated and will be removed in 4.39.0. "

"`is_torch_tpu_available` is deprecated and will be removed in 4.41.0. "

amyeroberts

Changes look good to me - thanks for iterating and explaining the design choices!

* add USE_TORCH_XLA env * rename torch_tpu to torch_xla * better is_torch_xla_available; fix some fsdp and performance issues * fix format * fix bug when pjrt_device is cpu * fix bug * fix the deprecation handling --------- Co-authored-by: anw90 <[email protected]> Co-authored-by: wangang.wa <[email protected]>

ArthurZucker requested a review from muellerzr March 7, 2024 11:34

muellerzr approved these changes Mar 7, 2024

View reviewed changes

anw90 and others added 6 commits March 8, 2024 09:56

add USE_TORCH_XLA env

fccdd0d

rename torch_tpu to torch_xla

5504241

better is_torch_xla_available; fix some fsdp and performance issues

82d0d76

fix format

ce73774

fix bug when pjrt_device is cpu

0d218e2

fix bug

f761c88

yitongh force-pushed the xla_gpu branch from 86e1c86 to f761c88 Compare March 8, 2024 02:08

muellerzr approved these changes Mar 8, 2024

View reviewed changes

muellerzr requested a review from amyeroberts March 8, 2024 16:00

amyeroberts reviewed Mar 8, 2024

View reviewed changes

fix the deprecation handling

1d6223e

amyeroberts approved these changes Mar 11, 2024

View reviewed changes

amyeroberts merged commit 873d9bb into huggingface:main Mar 11, 2024
20 checks passed

sechkova mentioned this pull request Apr 30, 2024

GPT2 CasualLM Inference crashes when using transformers v4.39.0 pytorch/xla#6991

Closed

hanwen-sun mentioned this pull request Oct 17, 2024

[Bug] transformers TPU support broken on v4.45.0 #34176

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make torch xla available on GPU #29334

Make torch xla available on GPU #29334

yitongh commented Feb 28, 2024

yitongh commented Feb 28, 2024 •

edited

Loading

yitongh commented Feb 28, 2024

yitongh commented Feb 28, 2024

HuggingFaceDocBuilderDev commented Mar 4, 2024

yitongh commented Mar 5, 2024

muellerzr left a comment

muellerzr commented Mar 7, 2024

yitongh commented Mar 8, 2024

muellerzr commented Mar 8, 2024

muellerzr commented Mar 8, 2024

muellerzr left a comment

amyeroberts left a comment

amyeroberts Mar 8, 2024

yitongh Mar 11, 2024

amyeroberts Mar 8, 2024

yitongh Mar 11, 2024

amyeroberts Mar 8, 2024

yitongh Mar 11, 2024

amyeroberts Mar 8, 2024

yitongh Mar 11, 2024

amyeroberts Mar 8, 2024

amyeroberts Mar 8, 2024

yitongh Mar 11, 2024

amyeroberts Mar 8, 2024

yitongh Mar 11, 2024

amyeroberts Mar 8, 2024

yitongh Mar 11, 2024

amyeroberts left a comment

	is_torch_xla_available,
	is_torch_tpu_available,
	is_torch_xla_available,

	"is_torchvision_available",
	"is_torch_tpu_available",
	"is_torchvision_available",

	"`is_torch_tpu_available` is deprecated and will be removed in 4.39.0. "
	"`is_torch_tpu_available` is deprecated and will be removed in 4.41.0. "

Make torch xla available on GPU #29334

Make torch xla available on GPU #29334

Conversation

yitongh commented Feb 28, 2024

What does this PR do?

Before submitting

Who can review?

yitongh commented Feb 28, 2024 • edited Loading

yitongh commented Feb 28, 2024

yitongh commented Feb 28, 2024

HuggingFaceDocBuilderDev commented Mar 4, 2024

yitongh commented Mar 5, 2024

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr commented Mar 7, 2024

yitongh commented Mar 8, 2024

muellerzr commented Mar 8, 2024

muellerzr commented Mar 8, 2024

muellerzr left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

yitongh commented Feb 28, 2024 •

edited

Loading