Nice start and could lead to something #63

kusog · 2023-12-07T05:46:05Z

kusog
Dec 7, 2023

I was able to get the answer to "What is the capitol of the United States of America?", but it took almost 2 hours. I was using 13900k with 192G ram 8T raid striped nvme. I also ran on dual xeon with 512G ram and dual 3090 with nvlink. Both got about the same performance. I notice the gpus are not really being utilized and it is single threaded cpu bound. I updated the code to load the safetensor files into memory buffers and then keep feeding that to the using the safetensor load function rather than load_file. I believe part of the bottle neck is loading the model into gpu so much. I would suggest allowing some of those loaded safetensors stay in the gpu and also to use multiple gpu. I'd like to see the memory buffers loaded into shared memory and used across one process per gpu per server. This also needs to scale horizontally across multiple servers. I have a total of 96G vram across 4 fairly high end GPUs (performance wise) and would like to see this technique applied with the parallel model approaches I've seen. Have you started doing any of that work lately?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nice start and could lead to something #63

{{title}}

Replies: 0 comments

Select a reply

Nice start and could lead to something #63

kusog Dec 7, 2023

Replies: 0 comments

kusog
Dec 7, 2023