-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stata 17 crashes when using iebaltab
with 4.3GB dataset
#368
Comments
Wow, you are really putting our code to the test. Fun! Here are my first reactions to what you have already tested:
Questions:
Suggestions:
Let me know what these comments make you think or what these suggestions teaches you. Happy to keep working with you until this is resolved. However, it might also be related to Stata (especially on Linux) where I would not be able to help with a solution. |
Glad to hear this is a "fun" problem! Answering your questions:
Based on Kristoffer thoughts and suggestions, I have a few suggestions for Raul so we can better understand what is happening:
Raul, if you prefer to prepare a do-file for #2, I will be happy to run and observe CPU and memory usage while it runs. Does Stata have a built-in profiler to show how much time and memory a code uses each function? If yes, it would be worth using a profiler in these additional tests. |
Hi Paula,
Sounds good, I am running #2 now, starting off with 5%
Unfortunately Stata does not does not have a built-in profiler, at least to the best of my knowledge
I can try #3 afterwards
Best,
Raul
…________________________________
De: Paula C. Sanematsu ***@***.***>
Enviado: viernes, 22 de noviembre de 2024 19:57
Para: worldbank/ietoolkit ***@***.***>
Cc: Raul Duarte Gonzalez ***@***.***>; Manual ***@***.***>
Asunto: Re: [worldbank/ietoolkit] Stata 17 crashes when using `iebaltab` with 4.3GB dataset (Issue #368)
Glad to hear this is a "fun" problem!
Answering your questions:
* We have GPU nodes! These machines have Intel Ice Lake with 4 A100s GPU cards, 487 GB of RAM and 172 GB of /tmp space. They are the nodes under fasse_gpu partition in our documentation<https://docs.rc.fas.harvard.edu/kb/fasse/#SLURM_and_Partitions>.
* When I observed Raul's job, it did not die exactly at 1h. I would say 57 or 58 minutes. But since Kristoff brought up perhaps a timeout issue, we should test it to make sure it can run beyond 1h. If this timeout exists, this would be a Stata limitation, not a cluster limitation. We use slurm<https://slurm.schedmd.com/documentation.html> scheduler. Raul's jobs all had a wall time ranging from 24 to 48 hours. Although the Stata computations die, the slurm job itself remains active. For the job that I ran on his behalf, I saw CPU usage going down and eventually stopping, but his slurm job was still up and resources were available for him.
* Some of our users run very large computations with 1000+ cores. However, Raul's analysis is likely our largest Stata job and unfortunately we don't have any staff with Stata expertise. I will reach out to my colleagues and do some research to see if I can find anything regarding the intersection of Stata and the task manager.
Based on Kristoffer thoughts and suggestions, I have a few suggestions for Raul so we can better understand what is happening:
1. For all runs below, set max_memory to a little lower than the requested memory (e.g. if you request 200G, set max_memory to 190 GB). Although we don't have indications of a memory issue, I think this is safer than no setting at all.
2. Rerun a do-file that uses a 5% random sample. I would like to test the 1h timeout hypothesis. You ran this before, but I would like to confirm that it ran beyond the 1h limit. If Stata allows, can you printout the date and time before and after the iebaltab function so we know how long that particular function ran? If the 5% runs in less than 1h, then increase the sample subset.
3. Run the original do-file using the fasse_gpu partition (i.e. GPU-enabled computer). To use these, when you request a Stata session, you have to request the fasse_gpu partition and request 1 in the "Number of GPUs". I am not sure if Stata needs extra settings to run on a GPU or if it works out of the box. You can check if the GPU card is being used by opening a terminal (Applications on the top left corner-> Terminal Emulator). Then execute the command nvtop. If the GPU is being used, you will see a graph with GPU % and GPU mem % being used.
Raul, if you prefer to prepare a do-file for #2<#2>, I will be happy to run and observe CPU and memory usage while it runs.
Does Stata have a built-in profiler to show how much time and memory a code uses each function? If yes, it would be worth using a profiler in these additional tests.
—
Reply to this email directly, view it on GitHub<#368 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFCGLRLGTQLF2XL4TPNYALD2B6ZEHAVCNFSM6AAAAABRMF2JEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJVGA2DMNJSGY>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Hi Paula,
This is just to confirm the code with the 5% sample takes longer than 1hr, it's still running. I will keep everyone posted on how long the iebaltab command takes with this sample though!
I am using the timer command on Stata to keep track of how long it takes: https://www.stata.com/manuals/ptimer.pdf
Best,
Raul
…________________________________
De: Raúl Duarte González ***@***.***>
Enviado: sábado, 23 de noviembre de 2024 18:12
Para: worldbank/ietoolkit ***@***.***>; worldbank/ietoolkit ***@***.***>
Cc: Manual ***@***.***>
Asunto: RE: [worldbank/ietoolkit] Stata 17 crashes when using `iebaltab` with 4.3GB dataset (Issue #368)
Hi Paula,
Sounds good, I am running #2 now, starting off with 5%
Unfortunately Stata does not does not have a built-in profiler, at least to the best of my knowledge
I can try #3 afterwards
Best,
Raul
________________________________
De: Paula C. Sanematsu ***@***.***>
Enviado: viernes, 22 de noviembre de 2024 19:57
Para: worldbank/ietoolkit ***@***.***>
Cc: Raul Duarte Gonzalez ***@***.***>; Manual ***@***.***>
Asunto: Re: [worldbank/ietoolkit] Stata 17 crashes when using `iebaltab` with 4.3GB dataset (Issue #368)
Glad to hear this is a "fun" problem!
Answering your questions:
* We have GPU nodes! These machines have Intel Ice Lake with 4 A100s GPU cards, 487 GB of RAM and 172 GB of /tmp space. They are the nodes under fasse_gpu partition in our documentation<https://docs.rc.fas.harvard.edu/kb/fasse/#SLURM_and_Partitions>.
* When I observed Raul's job, it did not die exactly at 1h. I would say 57 or 58 minutes. But since Kristoff brought up perhaps a timeout issue, we should test it to make sure it can run beyond 1h. If this timeout exists, this would be a Stata limitation, not a cluster limitation. We use slurm<https://slurm.schedmd.com/documentation.html> scheduler. Raul's jobs all had a wall time ranging from 24 to 48 hours. Although the Stata computations die, the slurm job itself remains active. For the job that I ran on his behalf, I saw CPU usage going down and eventually stopping, but his slurm job was still up and resources were available for him.
* Some of our users run very large computations with 1000+ cores. However, Raul's analysis is likely our largest Stata job and unfortunately we don't have any staff with Stata expertise. I will reach out to my colleagues and do some research to see if I can find anything regarding the intersection of Stata and the task manager.
Based on Kristoffer thoughts and suggestions, I have a few suggestions for Raul so we can better understand what is happening:
1. For all runs below, set max_memory to a little lower than the requested memory (e.g. if you request 200G, set max_memory to 190 GB). Although we don't have indications of a memory issue, I think this is safer than no setting at all.
2. Rerun a do-file that uses a 5% random sample. I would like to test the 1h timeout hypothesis. You ran this before, but I would like to confirm that it ran beyond the 1h limit. If Stata allows, can you printout the date and time before and after the iebaltab function so we know how long that particular function ran? If the 5% runs in less than 1h, then increase the sample subset.
3. Run the original do-file using the fasse_gpu partition (i.e. GPU-enabled computer). To use these, when you request a Stata session, you have to request the fasse_gpu partition and request 1 in the "Number of GPUs". I am not sure if Stata needs extra settings to run on a GPU or if it works out of the box. You can check if the GPU card is being used by opening a terminal (Applications on the top left corner-> Terminal Emulator). Then execute the command nvtop. If the GPU is being used, you will see a graph with GPU % and GPU mem % being used.
Raul, if you prefer to prepare a do-file for #2<#2>, I will be happy to run and observe CPU and memory usage while it runs.
Does Stata have a built-in profiler to show how much time and memory a code uses each function? If yes, it would be worth using a profiler in these additional tests.
—
Reply to this email directly, view it on GitHub<#368 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFCGLRLGTQLF2XL4TPNYALD2B6ZEHAVCNFSM6AAAAABRMF2JEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJVGA2DMNJSGY>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
I just noticed my last 2 emails did not get posted here, so I am copying them below: Ok, #2 just actually finished running and here's the output: . timer list This is supposed to be the time in seconds, so it took 30,336 approximately, which are is around 8.42 hours if I am calculating this correctly -- Also, I looked into #3 and it seems like Stata does not natively support GPU acceleration out of the box for its computations, so I am not sure of whether there is a way I can proceed with suggestion #3? Oh, and I had implemented suggestion #1 when I ran #2 by the way |
I am happy to keep supporting, especially if we discover that the issue is related to The Stata List might be a good forum for this question. https://www.statalist.org/. Stata Corp's own technical staff monitors that forum and often step in and help answering question that no community user has been able to answer. If you still do not get an answer, Harvard should have someone that handles Harvard's Stata licenses and has a counter part at Stata Corp that can help flag the question to their technical staff. My experience is that Stata Corp is very helpful if you can show that you first tried to get an answer from the community. You can reference this discussion in your Stata List post, but I would include a summary with the most relevant information in your post as many users there do not use GitHub and are more likely to make an effort to answer the question if you post more than a link to this page. If we learn something that comes back to the specific |
Thank you @reduarte for following up on the suggestions. As @kbjarkefur suggested, I posted on the Stata forum: https://www.statalist.org/forums/forum/general-stata-discussion/general/1768357-stata-17-crashes-during-large-computation. |
@kbjarkefur Thanks for advising to post on Stata forum. They replied shortly after I posted suggesting to make a change in the iebaltab function -- to use areg instead of regress. I am not familiar with Stata, so I am not sure how much work this entails. |
@paulasanematsu - we considered Since a regression with ~50K fixed effects is an edge case (in relation to typical Stata usage) it probably don't justify removing the other features Would updating your environment to Stata 18 as suggested in the Stata list reply be an option? Another solution is to code this yourself to fit this edge case. No individual regressions in
|
Hello,
I am a Research Computing Facilitator at FASRC. Raul Duarte reached out to our support because he was running a Stata code with the function
iebaltab
on our cluster and the job was dying midway through computation. We troubleshot extensively without much progress, so we are reaching out to you for guidance. I will try to summarize the computational environment and what we have done so far.Unfortunately, because Raul’s data cannot be shared (because of a Data Use Agreement [DUA] signed), we cannot share the data, but we will try to explain as much as possible.
Computational environment
fasse_bigmem
partition: Intel Ice Lake chipset, 499 GB of RAM,/tmp
space is 172 GBfasse_ultramem
partition: Intel Ice Lake chipset, 2000 GB of RAM,/tmp
space is 396 GBAnalysis
Raul wrote a Do file that uses the iebaltab function to analyze a dataset that is 4.3GB:
Raul wrote:
His typical run was on
fasse_bigmem
(499 GB of RAM and 64 cores).Troubleshooting steps
max_memory
to slightly less than the total memory, he set it to 495 GB when the memory requested onfasse_bigmem
was 499 GB.top
to see cpu and memory usage and I also kept checking the disk usage of/tmp
with thedu
command. The core usage was almost at 100% for all 64 cores, memory was at about 5-6% (of 499 GB), and /tmp had about 4-5 GB usage. At about 1h, I could see each process dying and everything stalled.I am hoping that you have some guidance if Raul possibly ran into a bug or something on our end that we need to change.
Thank you for taking the time to read this. We will be happy to answer any questions.
Best,
Paula and Raul
The text was updated successfully, but these errors were encountered: