You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was able to get the answer to "What is the capitol of the United States of America?", but it took almost 2 hours. I was using 13900k with 192G ram 8T raid striped nvme. I also ran on dual xeon with 512G ram and dual 3090 with nvlink. Both got about the same performance. I notice the gpus are not really being utilized and it is single threaded cpu bound. I updated the code to load the safetensor files into memory buffers and then keep feeding that to the using the safetensor load function rather than load_file. I believe part of the bottle neck is loading the model into gpu so much. I would suggest allowing some of those loaded safetensors stay in the gpu and also to use multiple gpu. I'd like to see the memory buffers loaded into shared memory and used across one process per gpu per server. This also needs to scale horizontally across multiple servers. I have a total of 96G vram across 4 fairly high end GPUs (performance wise) and would like to see this technique applied with the parallel model approaches I've seen. Have you started doing any of that work lately?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I was able to get the answer to "What is the capitol of the United States of America?", but it took almost 2 hours. I was using 13900k with 192G ram 8T raid striped nvme. I also ran on dual xeon with 512G ram and dual 3090 with nvlink. Both got about the same performance. I notice the gpus are not really being utilized and it is single threaded cpu bound. I updated the code to load the safetensor files into memory buffers and then keep feeding that to the using the safetensor load function rather than load_file. I believe part of the bottle neck is loading the model into gpu so much. I would suggest allowing some of those loaded safetensors stay in the gpu and also to use multiple gpu. I'd like to see the memory buffers loaded into shared memory and used across one process per gpu per server. This also needs to scale horizontally across multiple servers. I have a total of 96G vram across 4 fairly high end GPUs (performance wise) and would like to see this technique applied with the parallel model approaches I've seen. Have you started doing any of that work lately?
Beta Was this translation helpful? Give feedback.
All reactions