Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint/restore problem #17

Closed
YixinSong-e opened this issue Mar 12, 2023 · 1 comment
Closed

checkpoint/restore problem #17

YixinSong-e opened this issue Mar 12, 2023 · 1 comment

Comments

@YixinSong-e
Copy link

YixinSong-e commented Mar 12, 2023

When I use the test script for c/r.
The output is

[New Thread 0x7f37c96c5000 (LWP 38397)]
[New Thread 0x7f37a12e0000 (LWP 38398)]
[Detaching after fork from child process 38399]
[Thread 0x7f37c96c5000 (LWP 38397) exited]

Thread 1 "kernel.testapp" received signal SIGURG, Urgent I/O condition.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]                                                                                    
0x00007f37a2fae0a0 in kernel(unsigned short*, unsigned short*, unsigned short*, char, short, int, long long)<<<(1,1,1),(32,1,1)>>> ()                                                        
[DEBUG] gdb_init out.
Cuda api initialized and attached!
got API
Device "NVIDIA A100-SXM4-80GB":
        index: 0
        type: "GA100GL-A"
        SM type: "sm_80"
        lanes: 32
        predicates 8
        registers: 255
        SMs: 108
        warps: 64

checkpointing kernel with name: "_Z6kernelPtS_S_csix"
stack-size: 336, param-addr: 352, param-size: 40, param-num: 7
SM 0: 1 - 0000000000000000000000000000000000000000000000000000000000000001
03/12/23 10:42:43.888585   DEBUG: relative 6a0, virtual 7f37a2fae0a0    in /cricket-cr.c(670)
SM 0 warp 0 (active): 55555555 - 01010101010101010101010101010101
SM 0 warp 0 (valid): ffffffff - 11111111111111111111111111111111
03/12/23 10:42:43.888637   DEBUG: function "_Z6kernelPtS_S_csix" has no room (0 slots)  in /cricket-cr.c(831)                                                                                
03/12/23 10:42:43.888647   ERROR: There is no room in the top level function (i.e. the kernel). This kernel can thus never be restored!        in /cricket-cr.c(835)                         
cricket-checkpoint: could not make checkpointable.

Thread 1 "kernel.testapp" received signal SIGURG, Urgent I/O condition.
0x00007ffed552baea in clock_gettime ()
[Inferior 1 (process 38378) detached]

@n-eiling
Copy link
Member

n-eiling commented Jun 2, 2023

This is related to GPU kernel checkpointing. This is currently not working - but we are working on repairing this. (see #19 )
Note that most likely you don't need this. The code in the cpu subdirectory can also do checkpoints.

@n-eiling n-eiling closed this as completed Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants