Tensor mismatch on embedding op using bfloat16 weights. #1404

dgolubovicTT · 2024-11-25T18:38:09Z

Running single op embedding test for bfloat16 weights causes tensor mismatch.
However, this can't be reproed on ttnn embedding test.
Here is the ttnn IR of transpose case causing tensor mismatch: test_embedding_bfloat16_data_mismatch_ttnn.txt

Here is the ttnn repro test that passes:

@pytest.mark.parametrize("batch_size", [1])
@pytest.mark.parametrize("sentence_size", [12])
@pytest.mark.parametrize("hidden_embedding_dim", [3200])  # Bert_Num_Cols_768, Llama_Num_Cols
@pytest.mark.parametrize(
    "vocabulary_size", [32000]
)  # Bert_Position_Embeddings_512, Bert_Word_Embeddings_30528, Llama_Position_Embeddings,
@pytest.mark.parametrize("dtype", [ttnn.bfloat16])
@pytest.mark.parametrize("input_mem_config", [ttnn.DRAM_MEMORY_CONFIG])
@pytest.mark.parametrize("output_mem_config", [ttnn.DRAM_MEMORY_CONFIG])
@pytest.mark.parametrize("layout", [ttnn.ROW_MAJOR_LAYOUT])
def test_embedding(
    device,
    batch_size,
    sentence_size,
    hidden_embedding_dim,
    vocabulary_size,
    dtype,
    input_mem_config,
    output_mem_config,
    layout,
):
    torch.manual_seed(1234)

    torch_input_tensor = torch.randint(0, vocabulary_size - 1, (batch_size, sentence_size))
    torch_weights = torch_random((vocabulary_size, hidden_embedding_dim), -0.1, 0.1, dtype=torch.bfloat16)
    torch_output_tensor = torch.nn.functional.embedding(torch_input_tensor, torch_weights)

    input_tensor = ttnn.to_device(ttnn.from_torch(torch_input_tensor), device, memory_config=input_mem_config)
    weights = ttnn.to_device(ttnn.from_torch(torch_weights, dtype=dtype), device, memory_config=input_mem_config)

    output_tensor = ttnn.embedding(input_tensor, weights, memory_config=output_mem_config, layout=layout)
    output_tensor = ttnn.to_torch(output_tensor)

    assert_with_pcc(torch_output_tensor, output_tensor)

Comparing ttnn test and ttnn IR I can't find what is the difference that could cause ttnn test to pass and ttnn IR to fail.
Need help from someone on mlir side @sdjordjevicTT.

Note: Embedding op doesn't support float32 as weights therefore I tried bfloat16 and ran into this.

The text was updated successfully, but these errors were encountered:

dgolubovicTT · 2024-12-10T09:59:12Z

As agreed offline I am providing you the ttir and ttnn on latest main of forge.LlamaEmbedding_data_mismatch_bfloat16_ttir.txt
Llama_Embedding_data_mismatch_bfloat16_ttnn.txt

…dd mlir hacks to push the compile to the end. Now embedding hangs in ttnn runtime, which is expected from tenstorrent/tt-mlir#1404

nsmithtt · 2024-12-10T21:49:31Z

We have a "new" golden flow that we are developing that I think could be used for unit testing ops like this. See: https://github.com/tenstorrent/tt-mlir/blob/main/test/python/golden/test_ttir_ops.py

Embedding op + golden func will have to be added to https://github.com/tenstorrent/tt-mlir/blob/main/python/test_infra/ttir_builder.py

Sync with @ctodTT for questions.

dgolubovicTT assigned sdjordjevicTT Nov 25, 2024

sdjordjevicTT assigned jserbedzijaTT Dec 2, 2024

tapspatel mentioned this issue Dec 10, 2024

Port over missing ops into ttir builder infra. #1539

Open

jserbedzijaTT mentioned this issue Dec 18, 2024

Fix embedding data missmatch #1633

Merged

jserbedzijaTT closed this as completed in #1633 Dec 19, 2024

jserbedzijaTT closed this as completed in 305ea47 Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor mismatch on embedding op using bfloat16 weights. #1404

Tensor mismatch on embedding op using bfloat16 weights. #1404

dgolubovicTT commented Nov 25, 2024

dgolubovicTT commented Dec 10, 2024

nsmithtt commented Dec 10, 2024

Tensor mismatch on embedding op using bfloat16 weights. #1404

Tensor mismatch on embedding op using bfloat16 weights. #1404

Comments

dgolubovicTT commented Nov 25, 2024

dgolubovicTT commented Dec 10, 2024

nsmithtt commented Dec 10, 2024