Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stata 17 crashes when using iebaltab with 4.3GB dataset #368

Open
paulasanematsu opened this issue Nov 7, 2024 · 9 comments
Open

Stata 17 crashes when using iebaltab with 4.3GB dataset #368

paulasanematsu opened this issue Nov 7, 2024 · 9 comments

Comments

@paulasanematsu
Copy link

paulasanematsu commented Nov 7, 2024

Hello,

I am a Research Computing Facilitator at FASRC. Raul Duarte reached out to our support because he was running a Stata code with the function iebaltab on our cluster and the job was dying midway through computation. We troubleshot extensively without much progress, so we are reaching out to you for guidance. I will try to summarize the computational environment and what we have done so far.

Unfortunately, because Raul’s data cannot be shared (because of a Data Use Agreement [DUA] signed), we cannot share the data, but we will try to explain as much as possible.

Computational environment

  • OS: Rocky Linux 8.9
  • Hardware (for more details, see https://docs.rc.fas.harvard.edu/kb/fasse/#SLURM_and_Partitions):
    • fasse_bigmem partition: Intel Ice Lake chipset, 499 GB of RAM, /tmp space is 172 GB
    • fasse_ultramem partition: Intel Ice Lake chipset, 2000 GB of RAM, /tmp space is 396 GB
  • Stata: version 17.0 with MP (64 cores)

Analysis

Raul wrote a Do file that uses the iebaltab function to analyze a dataset that is 4.3GB:

iebaltab median_hs6_unit_price median_hs6_cifdoldecla median_hs6_imponiblegs unit_price_final cifdoldecla imponiblegs, replace grpvar(val_count_patronage_hire) fixedeffect(port_day_ID) ///
	savetex("$DirOutFasse\baltab_val_shipment_item_values_counter_day.tex") ///
	grplabels(0 "Non-patronage" @ 1 "Patronage")  format(%12.0fc) order(1 0) ///
	rowlabels(median_hs6_unit_price "Median HS6 unit price (in USD)" @ median_hs6_cifdoldecla "Median HS6 CIF value (in USD)" ///
		@ median_hs6_imponiblegs "Median HS6 tax base (in PYG)" @ unit_price_final "Unit price (in USD)" ///
		@ cifdoldecla "Declared CIF value (in USD)" @ imponiblegs "Tax base (in PYG)") nonote

Raul wrote:

This line uses the following command to create a balance table. My dataset is a database of imports and for the balance table tests of difference between two groups (patronage and non-patronage) handling shipment items I want to include port-day fixed effects (and since I have 5 years of data and 31 customs ports), this could lead to more than 56,000 fixed effects included, which seems to be what is leading to problems, as the balance table does run without the fixed effects.

His typical run was on fasse_bigmem (499 GB of RAM and 64 cores).

Troubleshooting steps

  1. On the Stata GUI, Raul tried the following:
    1. To rule out-of-memory errors, he tested the Do-file on our computer with 2000 GB of RAM and 64 cores and still ran into the same problem.
    2. Successfully ran the Do-file with iebaltab on a subset of his original dataset. The subset is a 5% random sample of the original dataset.
    3. Checked that he is not exceeding any of the Stata settings
    4. Set the max_memory to slightly less than the total memory, he set it to 495 GB when the memory requested on fasse_bigmem was 499 GB.
    5. Tried to run with Stata-SE using a single core, but Stata threw an error that it could not handle as many variables with the SE version.
    6. I suggested using debugging mode (https://www.stata.com/support/faqs/programming/debugging-program/), but that has not helped to provide more valuable information about the error
  2. On the command line, I submitted a job via the scheduler to run the same Do-file using the original dataset
    1. While the job was running, I used top to see cpu and memory usage and I also kept checking the disk usage of /tmp with the du command. The core usage was almost at 100% for all 64 cores, memory was at about 5-6% (of 499 GB), and /tmp had about 4-5 GB usage. At about 1h, I could see each process dying and everything stalled.

I am hoping that you have some guidance if Raul possibly ran into a bug or something on our end that we need to change.

Thank you for taking the time to read this. We will be happy to answer any questions.

Best,
Paula and Raul

@kbjarkefur
Copy link
Contributor

Wow, you are really putting our code to the test. Fun!

Here are my first reactions to what you have already tested:

  • It does not seem to be an issue with memory. While that is good, it is usually the kind of issue that can be solved by improving the code. The command uses temp files to store intermediate results. I paid a lot of attention to ensuring no longer relevant temp files are deleted.
  • The high dimensionality of this equation (stemming from 56,000 fixed effects on a large number of observations) would create a significant load for the CPU. And you are saying that the CPU is at 100%. This observation makes sense, but it does not explain why the process would stop. The CPU should be able to handle the long queue of tasks and process them as capacity becomes available.

Questions:

  • Do you have a GPU-enabled cluster? GPUs are better suited to handle very large throughputs of computational tasks. However, iebaltab does not implement any GPU support beyond what Stata's built-in commands support. Therefore, it is hard for me to say how much of a difference this would make.
  • I do not think this is likely to be the issue on FARSC's cluster, but I'm still curious about what we can learn regarding timeouts or other constraints.
    • You say the processes die down after ~1 hour. How exactly 1 hour? And is that time consistent regardless of the workload? What if you were to randomize a sub-sample (perhaps 80% or 50%)? Would the process fail at a very similar point in time? This could suggest an issue with some time-out.
    • This question pushes my understanding of CPUs, but all CPUs have advanced task managers. In GPUs, this is much simpler, but GPUs cannot handle all types of tasks. However, matrix multiplication in regression estimations is something GPUs handle very well. Since the task manager needs to manage an extremely large queue, could it somehow run out of capacity? I do not think this computation is the largest FARSC has ever seen, but perhaps it is the largest one involving Stata? There might be something happening at the intersection of Stata and the task manager. Especially Stata on Linux systems which is the least used version of Stata and therefore most likely to have a bug or a rare unhandled corner case.

Suggestions:

  • In most modern operating systems, Stata's memory setting can be configured with set max_memory ., which allows Stata to manage memory dynamically as needed. While this is unlikely to be the cause of the issue since it does not seem memory-related, it is good to be aware of this setting.
  • I understand that subsetting the observations is not a valid approach as it generates different results. But does this command work if you run one balance variable at a time? The way you specify the command, each balance variable is analyzed independently in this command, so if you were to run one balance variable at a time, you would still get the same results. The only drawback is that you would need to combine the outputted LaTeX files after all estimations are completed. Not optimal, but if it works, it is likely to be your quickest solution to this issue.

Let me know what these comments make you think or what these suggestions teaches you. Happy to keep working with you until this is resolved. However, it might also be related to Stata (especially on Linux) where I would not be able to help with a solution.

@paulasanematsu
Copy link
Author

Glad to hear this is a "fun" problem!

Answering your questions:

  • We have GPU nodes! These machines have Intel Ice Lake with 4 A100s GPU cards, 487 GB of RAM and 172 GB of /tmp space. They are the nodes under fasse_gpu partition in our documentation.
  • When I observed Raul's job, it did not die exactly at 1h. I would say 57 or 58 minutes. But since Kristoff brought up perhaps a timeout issue, we should test it to make sure it can run beyond 1h. If this timeout exists, this would be a Stata limitation, not a cluster limitation. We use slurm scheduler. Raul's jobs all had a wall time ranging from 24 to 48 hours. Although the Stata computations die, the slurm job itself remains active. For the job that I ran on his behalf, I saw CPU usage going down and eventually stopping, but his slurm job was still up and resources were available for him.
  • Some of our users run very large computations with 1000+ cores. However, Raul's analysis is likely our largest Stata job and unfortunately we don't have any staff with Stata expertise. I will reach out to my colleagues and do some research to see if I can find anything regarding the intersection of Stata and the task manager.

Based on Kristoffer thoughts and suggestions, I have a few suggestions for Raul so we can better understand what is happening:

  1. For all runs below, set max_memory to a little lower than the requested memory (e.g. if you request 200G, set max_memory to 190 GB). Although we don't have indications of a memory issue, I think this is safer than no setting at all.

  2. Rerun a do-file that uses a 5% random sample. I would like to test the 1h timeout hypothesis. You ran this before, but I would like to confirm that it ran beyond the 1h limit. If Stata allows, can you printout the date and time before and after the iebaltab function so we know how long that particular function ran? If the 5% runs in less than 1h, then increase the sample subset.

  3. Run the original do-file using the fasse_gpu partition (i.e. GPU-enabled computer). To use these, when you request a Stata session, you have to request the fasse_gpu partition and request 1 in the "Number of GPUs". I am not sure if Stata needs extra settings to run on a GPU or if it works out of the box. You can check if the GPU card is being used by opening a terminal (Applications on the top left corner-> Terminal Emulator). Then execute the command nvtop. If the GPU is being used, you will see a graph with GPU % and GPU mem % being used.

Raul, if you prefer to prepare a do-file for #2, I will be happy to run and observe CPU and memory usage while it runs.

Does Stata have a built-in profiler to show how much time and memory a code uses each function? If yes, it would be worth using a profiler in these additional tests.

@reduarte
Copy link

reduarte commented Nov 23, 2024 via email

@reduarte
Copy link

reduarte commented Nov 24, 2024 via email

@reduarte
Copy link

I just noticed my last 2 emails did not get posted here, so I am copying them below:

Ok, #2 just actually finished running and here's the output:

. timer list
1: 30335.58 / 1 = 30335.5760

timer

This is supposed to be the time in seconds, so it took 30,336 approximately, which are is around 8.42 hours if I am calculating this correctly

--

Also, I looked into #3 and it seems like Stata does not natively support GPU acceleration out of the box for its computations, so I am not sure of whether there is a way I can proceed with suggestion #3? Oh, and I had implemented suggestion #1 when I ran #2 by the way

@kbjarkefur
Copy link
Contributor

I am happy to keep supporting, especially if we discover that the issue is related to iebaltab or we discover that something can be improved in iebaltab to better handle large computations. However, since the command runs on 5% of the sample and we have no error message thrown by the command, I am not sure where my experience is useful.

The Stata List might be a good forum for this question. https://www.statalist.org/. Stata Corp's own technical staff monitors that forum and often step in and help answering question that no community user has been able to answer. If you still do not get an answer, Harvard should have someone that handles Harvard's Stata licenses and has a counter part at Stata Corp that can help flag the question to their technical staff. My experience is that Stata Corp is very helpful if you can show that you first tried to get an answer from the community.

You can reference this discussion in your Stata List post, but I would include a summary with the most relevant information in your post as many users there do not use GitHub and are more likely to make an effort to answer the question if you post more than a link to this page.

If we learn something that comes back to the specific iebaltab command, then I am happy to help implement the fix.

@paulasanematsu
Copy link
Author

Thank you @reduarte for following up on the suggestions.

As @kbjarkefur suggested, I posted on the Stata forum: https://www.statalist.org/forums/forum/general-stata-discussion/general/1768357-stata-17-crashes-during-large-computation.

@paulasanematsu
Copy link
Author

@kbjarkefur Thanks for advising to post on Stata forum. They replied shortly after I posted suggesting to make a change in the iebaltab function -- to use areg instead of regress. I am not familiar with Stata, so I am not sure how much work this entails.

@kbjarkefur
Copy link
Contributor

kbjarkefur commented Nov 28, 2024

@paulasanematsu - we considered areg as it is faster than regress, but areg does not support all the things that we wanted iebaltab to be able to support, so we went with regress. (While I recall us having this discussion, I do not remember of the top of my head what those features were as this was several years ago)

Since a regression with ~50K fixed effects is an edge case (in relation to typical Stata usage) it probably don't justify removing the other features iebaltab that a switch to areg would require.

Would updating your environment to Stata 18 as suggested in the Stata list reply be an option?

Another solution is to code this yourself to fit this edge case. No individual regressions in iebaltab is complex, and the regression used are documented in the helpfile.

  • The main benefit of the command is that it simplifies the specification of them and has options that matches terminology related to balance tables. It should be quite straightforward to find the relevant regressions in the help file given @reduarte's specifications and then you can swap to areg and see if it supports that specification. The main difference is that in areg you need to put the fixed effects in the option absorb() instead of including them as independent variables.
  • Another benefit is that iebaltab supports outputting the results to latex or csv files. However, if that is the only pain point then there are packages such as estout that can help with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants