Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-core simulation encountered 192cores bottleneck #157

Open
zhb9103 opened this issue Oct 2, 2024 · 12 comments
Open

Multi-core simulation encountered 192cores bottleneck #157

zhb9103 opened this issue Oct 2, 2024 · 12 comments

Comments

@zhb9103
Copy link

zhb9103 commented Oct 2, 2024

Hi experts:

I git the openpiton_dev branch, and changed the code reference the second last Metro-MPI commit (https://github.com/metro-mpi/metro-mpi/commits/metro-mpi/ commit 264b365).

I use "sims -sys=manycore -x_tiles=16 -y_tiles=12 -msm_build -ariane" generated 192 cores(or below 192 cores xy-tiles configuration), use "sims -sys=manycore -msm_run -x_tiles=4 -y_tiles=4 hello_world_many.c -ariane -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000" simulated and I can see
Hello world, this is hart 0 of 16 harts!
Hello world, this is hart 1 of 16 harts!
Hello world, this is hart 2 of 16 harts!
Hello world, this is hart 3 of 16 harts!
Hello world, this is hart 4 of 16 harts!
Hello world, this is hart 5 of 16 harts!
Hello world, this is hart 6 of 16 harts!
Hello world, this is hart 7 of 16 harts!
Hello world, this is hart 8 of 16 harts!
Hello world, this is hart 9 of 16 harts!
Hello world, this is hart 10 of 16 harts!
Hello world, this is hart 11 of 16 harts!
Hello world, this is hart 12 of 16 harts!
Hello world, this is hart 13 of 16 harts!
Hello world, this is hart 14 of 16 harts!
Hello world, this is hart 15 of 16 harts!
information in the fake_uart.log

I use "sims -sys=manycore -x_tiles=16 -y_tiles=13 -msm_build -ariane" generated 208 cores(or above 192 cores xy-tiles configuration), use "sims -sys=manycore -msm_run -x_tiles=4 -y_tiles=4 hello_world_many.c -ariane -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000" simulated and waited a long time(above 12 hours), but I can't see any print in the fake_uart.log

Is there other limitation for above 192 cores?

Thanks!

@zhb9103 zhb9103 changed the title Multi-core simulation encounted 192cores limitation Multi-core simulation encounted 192cores bottleneck Oct 2, 2024
@Jbalkind
Copy link
Collaborator

Jbalkind commented Oct 2, 2024

Can you instead try the hello_world_token.c test that I think should be released with metro-mpi? There's a software bottleneck in the test itself which that test should help with in place of _many.c

@zhb9103
Copy link
Author

zhb9103 commented Oct 2, 2024

Ok, I will try it, thank you very much!

1 similar comment
@zhb9103
Copy link
Author

zhb9103 commented Oct 2, 2024

Ok, I will try it, thank you very much!

@zhb9103
Copy link
Author

zhb9103 commented Oct 2, 2024

I tried to use hello_world_token.c instead of hello_world_many.c to test, can't see any print in the fake_uart.log too. And I found all of the trace_hart_*.log files are empty, it represents no any communication in the test. Is there any data width requirement for above 192 cores?

After a while, I can see the follow information in the trace_hart_5.log
Exception @ 34127500, PC: 000000fff1010000, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 37715500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 39707500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 43025500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 46561500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 49670500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 52765500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 55861500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 58970500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 62065500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 65161500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 68267500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 71365500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 74460500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 77556500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 80665500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 83760500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 86856500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 89965500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 93060500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 96156500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 99265500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 102360500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 105455500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 108551500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 111660500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 114755500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 117851500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 120960500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 124055500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 127151500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 130260500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 133355500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 136450500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 139557500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 142655500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 145750500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 148846500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 151955500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 155050500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 158146500, PC: 000000fff1010040, Cause: Ille

@guillemlp
Copy link

I am quite intrigued by the results you are getting. I have experienced similar problems in the past.
You are using the commit of metro_mpi but you are not simulating w metro_mpi right?
Can you point me which hello_world.c/hello_world_many.c are you using?

@zhb9103
Copy link
Author

zhb9103 commented Oct 2, 2024

Hi @guillemlp

Reply to you as below:

You are using the commit of metro_mpi but you are not simulating w metro_mpi right?
---> I am not really using the metrol_mpi project, I just change *_LSID width and relevant values base on the openpiton_dev project reference the metro_mpi. Done that, I can get more than 64 cores, but I encountered 192 cores bottleneck now.

Can you point me which hello_world.c/hello_world_many.c are you using?
--->Yes I am using the hello_world_many.c to test now. I used the hello_world_token.c to test before, but nothing print in the fake_uart.log.

Thanks!

@zhb9103 zhb9103 changed the title Multi-core simulation encounted 192cores bottleneck Multi-core simulation encountered 192cores bottleneck Oct 2, 2024
@zhb9103
Copy link
Author

zhb9103 commented Oct 3, 2024

I retried the test with hello_world_token.c, the phenomenon as the tested with hello_world_many.c

@guillemlp
Copy link

can you verify if argv variable in main is char or int? (should be int if you are using more than 64 cores)
have you tried 128 cores doing the hello world token correctly?

@zhb9103
Copy link
Author

zhb9103 commented Oct 5, 2024

Hi @guillemlp:

can you verify if argv variable in main is char or int? (should be int if you are using more than 64 cores)
---> I have done that, the related code as below:

  1. syscalls.c
    int attribute((weak)) main(int argc, int** argv)
    {
    // single-threaded programs override this function.
    printstr("Implement main(), foo!\n");
    return -1;
    }
    ...
    // always init all threads
    void _init(int cid, int nc)
    {
    volatile static uint32_t finish_sync0 = 0;
    volatile static uint32_t finish_sync1 = 0;

//char num[2] = {cid, nc};
//char *argv[1] = {num};
int num[2] = {cid, nc};
int *argv[1] = {num};
int ret = main(2, argv);

ATOMIC_OP(finish_sync0, 1, add, w);
//asm volatile ( " amoadd.w zero, %1, %0" : "+A" (finish_sync0) : "r" (1) : "memory");
while(finish_sync0 != nc);

// synchronize for debug output below
while(finish_sync1 != cid);

char buf[NUM_COUNTERS * 32] attribute((aligned(64)));
char* pbuf = buf;
for (int i = 0; i < NUM_COUNTERS; i++)
if (counters[i])
pbuf += sprintf(pbuf, "core %d: %s = %d\n", cid, counter_names[i], counters[i]);
if (pbuf != buf)
printstr(buf);

ATOMIC_OP(finish_sync1, 1, add, w);
//asm volatile ( " amoadd.w zero, %1, %0" : "+A" (finish_sync1) : "r" (1) : "memory");

exit(ret);
...

  1. hello_world_many.c
    int main(int argc, int** argv) {

// synchronization variable
volatile static uint32_t amo_cnt = 0;

// synchronize with other cores and wait until it is this core's turn
while(argv[0][0] != amo_cnt);

// assemble number and print
printf("Hello world, this is hart %d of %d harts!\n", argv[0][0], argv[0][1]);

// increment atomic counter
ATOMIC_OP(amo_cnt, 1, add, w);

return 0;
}
...
These changes are fine for below 192 cores. But it is not work for above 192 cores.

have you tried 128 cores doing the hello world token correctly?
---> Yes, I have tried it, it is working well. The print in the fake_uart.log as below:
0 10
1 10
2 10
3 10
...

Further more, I have tried 192 cores to test with hello_world_token.c, it is working well too. And have tried 208 cores, it is not work, it looks the fetched instruction is incorrect.

Thank!

@guillemlp
Copy link

Which NoC sizes are you playing with?
I have only tried 128/256/512/1024 cores
can you try 256 e.g. 16X16 NOC ?

@zhb9103
Copy link
Author

zhb9103 commented Oct 7, 2024

Hi @guillemlp:

I have tried 16*16 cores configuration, it doesn't work. My steps as below:

  1. SOC building command
    sims -sys=manycore -x_tiles=16 -y_tiles=16 -vcs_build -ariane -config_rtl=MINIMAL_MONITORING

  2. SOC simulation command
    sims -sys=manycore -vcs_run -x_tiles=16 -y_tiles=16 hello_world_token.c -ariane -config_rtl=MINIMAL_MONITORING -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000

after a while, I can see the information in the trace_hart_*.log as below:
Exception @ 34127500, PC: 000000fff1010000, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 37715500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 39707500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 43025500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
...

It is a little bit difficult to find the rootcause.

I will git clone MPI project to do the test(256 cores or above). I think I might have lost something.

Thanks!

@zhb9103
Copy link
Author

zhb9103 commented Oct 7, 2024

Hi @guillemlp:

I have done steps as below on the MPI project:

  1. sims -sys=manycore -x_tiles=16 -y_tiles=16 -vcs_build -ariane -config_rtl=MINIMAL_MONITORING
  2. sims -sys=manycore -vcs_run -x_tiles=16 -y_tiles=16 hello_world_token.c -ariane -config_rtl=MINIMAL_MONITORING -finish_mask 0x1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 -rtl_timeout 1000000000000

But I can see the information in the trace_hart_*.log as below:
Exception @ 34127500, PC: 000000fff1010000, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 37715500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 39707500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
Exception @ 43025500, PC: 000000fff1010040, Cause: Illegal Instruction,
tval: 0000000000000000
...

I don't know what wrong I did. Could you help to check for me?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants