Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export debug information to StableHLO #7014

Closed
thong3le opened this issue May 1, 2024 · 11 comments · Fixed by #7046
Closed

Export debug information to StableHLO #7014

thong3le opened this issue May 1, 2024 · 11 comments · Fixed by #7046
Assignees
Labels
stablehlo StableHLO related work

Comments

@thong3le
Copy link

thong3le commented May 1, 2024

❓ Questions and Help

Hi team, the debugging information is lost during exported_program_to_stablehlo, is there a way to export this information?

For example, torch.export generates file and line number for each op,

import torch
import torch.nn as nn
from torch_xla.stablehlo import exported_program_to_stablehlo

class Test(nn.Module):
    def forward(self, a, b):
        a += 1
        b += 2
        return a + b

ep = torch.export.export(Test(), (torch.randn(1, 5), torch.randn(1, 5)))
print(ep)
# ExportedProgram:
#     class GraphModule(torch.nn.Module):
#         def forward(self, arg0_1: "f32[1, 5]", arg1_1: "f32[1, 5]"):
#             # File: /home/thonle/ai/data/stablehlo/add/add.py:7 in forward, code: a += 1
#             add: "f32[1, 5]" = torch.ops.aten.add.Tensor(arg0_1, 1);  arg0_1 = None
            
#             # File: /home/thonle/ai/data/stablehlo/add/add.py:8 in forward, code: b += 2
#             add_1: "f32[1, 5]" = torch.ops.aten.add.Tensor(arg1_1, 2);  arg1_1 = None
            
#             # File: /home/thonle/ai/data/stablehlo/add/add.py:9 in forward, code: return a + b
#             add_2: "f32[1, 5]" = torch.ops.aten.add.Tensor(add, add_1)
#             return (add, add_1, add_2)

however, when we export to stablehlo, we couldn't find this information in StableHLOModelBundle.

om = exported_program_to_stablehlo(ep)
print(om._bundle)

# StableHLOModelBundle(state_dict={}, additional_constants=[array(2., dtype=float32)], stablehlo_funcs=[StableHLOFunc(meta=StableHLOFunctionMeta(name='forward', stablehlo_version='0.0.0', input_signature=[VariableSignature(shape=[1, 5], dtype='float32', dynamic_dims=[]), VariableSignature(shape=[], dtype='float32', dynamic_dims=[]), VariableSignature(shape=[1, 5], dtype='float32', dynamic_dims=[])], output_signature=[VariableSignature(shape=[1, 5], dtype='float32', dynamic_dims=[]), VariableSignature(shape=[1, 5], dtype='float32', dynamic_dims=[]), VariableSignature(shape=[1, 5], dtype='float32', dynamic_dims=[])], input_locations=[InputLocation(type_=<VariableType.INPUT_ARG: 'input_arg'>, position=0, name=''), InputLocation(type_=<VariableType.CONSTANT: 'constant'>, position=0, name=''), InputLocation(type_=<VariableType.INPUT_ARG: 'input_arg'>, position=1, name='')], unused_inputs=[], input_pytree_spec='[1, {"type": "builtins.tuple", "context": "null", "children_spec": [{"type": "builtins.tuple", "context": "null", "children_spec": [{"type": null, "context": null, "children_spec": []}, {"type": null, "context": null, "children_spec": []}]}, {"type": "builtins.dict", "context": "[]", "children_spec": []}]}]', output_pytree_spec='[1, {"type": null, "context": null, "children_spec": []}]'), bytecode=b"ML\xefR\rStableHLO_v0.19.1\x00\x01\x1d\x05\x01\x05\r\x01\x03\x0b\x03\x0b\x0f\x13\x17\x1b\x1f\x03S1\x0f\x01%\x07\x0f#\x0b\x0b\x0b\x0b\x0b\x0f\x0b\x0f\x0b\x0f\x0b\x0f\x0b\x0f\x0b\x03\r\x0b\x0b\x0b\x0b\x1f\x0f\x01\x03\x0b\x03\r\x17\x07\x0f'\x13\x07\x02\xb5\x1f\x11\x01\x00\x03\x07\x07\t\x0b\x03\r\x03\x05\x11\x01\x01\x05\x13\x05\x15\x05\x17\x1d\x13\x01\x05\x19\x1d\x17\x01\x05\x1b\x1d\x1b\x01\x05\x1d\x1d\x1f\x01\x05\x1f\x1d#\x01\x05!\x03\x01#\t\x1d#\x1d%\x1f\x03\t\x00\x00\x80?\x1f\x0b\x01\x01\t)\x05\x05\x15\x05\t)\x01\x05\x11\x07\x03\x07\x03\x07\x03\x03\x03)\x03\x01\r\x1d\x04\x91\x05\x01Q\x01\x05\x01\x07\x04\x7f\x03\x01\x05\x05P\x01\x03\x07\x04k\x03\x11\x1b\x07\x05\r\x05\x00\x07B\x11\x05\x03\x03\x03\x06\x15\x03\x03\x05\x01\x07\tF\x19\x07\x03\x03\x03\x03\x03\x06\x1d\x03\x03\x05\x05\x0b\x03\x06!\x03\x03\x05\t\r\x0b\x04\x01\x07\t\r\x0f\x06\x03\x01\x05\x01\x00\xb6\x03'\x03\x0b\x0f\x0f\x1b\r\x19\x17A!=\x15)\x19\x11\x0f\x0f\x0b\x11builtin\x00vhlo\x00module\x00add_v1\x00func_v1\x00constant_v1\x00broadcast_in_dim_v1\x00return_v1\x00mhlo.cross_program_prefetches\x00mhlo.is_dynamic\x00mhlo.use_auto_spmd_partitioning\x00IrToHlo.18\x00broadcast.5\x00add.6\x00broadcast.11\x00add.12\x00add.16\x00main\x00\x00\x08\x1d\t\x05\x1f\x01\x0b%'%)+\x03-\x03/", text='module @IrToHlo.18 attributes {mhlo.cross_program_prefetches = [], mhlo.is_dynamic = false, mhlo.use_auto_spmd_partitioning = false} {\n  func.func @main(%arg0: tensor<1x5xf32>, %arg1: tensor<f32>, %arg2: tensor<1x5xf32>) -> (tensor<1x5xf32>, tensor<1x5xf32>, tensor<1x5xf32>) {\n    %0 = stablehlo.constant dense<1.000000e+00> : tensor<1x5xf32>\n    %1 = stablehlo.add %arg0, %0 : tensor<1x5xf32>\n    %2 = stablehlo.broadcast_in_dim %arg1, dims = [] : (tensor<f32>) -> tensor<1x5xf32>\n    %3 = stablehlo.add %arg2, %2 : tensor<1x5xf32>\n    %4 = stablehlo.add %1, %3 : tensor<1x5xf32>\n    return %1, %3, %4 : tensor<1x5xf32>, tensor<1x5xf32>, tensor<1x5xf32>\n  }\n}\n')])
@thong3le
Copy link
Author

thong3le commented May 1, 2024

cc @JackCaoG

@JackCaoG
Copy link
Collaborator

JackCaoG commented May 1, 2024

This will be hard.. The way we consume the fx is to actually run the fx graph and lower them. during this process all of the comment will be ignored..

@lsy323 @qihqi in case you guys have other ideas.

@thong3le
Copy link
Author

thong3le commented May 1, 2024

@JackCaoG I see, the stack trace can also be found in metadata of the fx node, e.g. node.meta

ipdb> nodes = list(ep.graph.nodes)
ipdb> nodes[2].meta
{'stack_trace': '  File "/home/thonle/ai/data/stablehlo/add/add.py", line 7, in forward\n    a += 1\n', 'nn_module_stack': {'L__self__': ('', <class '__main__.Test'>)}, 'source_fn_stack': [('iadd', <built-in function iadd>)], 'original_aten': <OpOverload(op='aten.add', overload='Tensor')>, 'from_node': [('a', <built-in function iadd>)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 5)), 'tensor_meta': TensorMetadata(shape=torch.Size([1, 5]), dtype=torch.float32, requires_grad=False, stride=(5, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})}

two follow-up questions,

  1. is there a way to export some fx node metadata into attributes of stablehlo op?
  2. what is the best way to debug if stablehlo func produces the incorrect output?

@JackCaoG
Copy link
Collaborator

JackCaoG commented May 2, 2024

for 1 I am not sure, @lsy323 and @qihqi might know better.
for 2, you can do binary search I guess? reduce the length of the model and figure out which pytorch op/layer gave you the incorrect answer.

@thong3le
Copy link
Author

thong3le commented May 2, 2024

thanks @JackCaoG, are you aware of existing tool for (2)?

@JackCaoG
Copy link
Collaborator

JackCaoG commented May 2, 2024

There is this #5461 but I never used it myself.

@lsy323
Copy link
Collaborator

lsy323 commented May 2, 2024

@thong3le If you turn on the env var XLA_HLO_DEBUG=1, you can get some debug info in the exported StableHLO now. But it's different and less useful than the nn_module_stack in the FX node. The nn_module_stack in the FX node cannot be propagated to StableHLO export now, but it's possible.

@thong3le
Copy link
Author

thong3le commented May 3, 2024

@lsy323 thanks, is there any plan to propagate nn_module_stack to StableHLO?

@tlsdmstn56
Copy link

I also have a similar feature request and wonder if there is a plan to propagate any metadata in fx.Node.meta as op attribute.

@GleasonK
Copy link
Collaborator

GleasonK commented May 9, 2024

I'm curious what bits of the metadata are important? File-line-col info? arg0_1 = None? Everything?

@lsy323 lsy323 self-assigned this May 10, 2024
@lsy323 lsy323 added the stablehlo StableHLO related work label May 10, 2024
@lsy323
Copy link
Collaborator

lsy323 commented May 10, 2024

The support is added in #7046

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stablehlo StableHLO related work
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants