Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to add timeout #90

Open
tornikeo opened this issue Nov 17, 2024 · 7 comments
Open

Need to add timeout #90

tornikeo opened this issue Nov 17, 2024 · 7 comments

Comments

@tornikeo
Copy link
Contributor

I need an option for per-process timeout in pqxms.

The last 2 runs of the 350 productions runs on example tetrahydrofuran take more than half the time the first 348 took. We need a timeout argument here.

This could be as simple as adding timeout $timeout somewhere in pqcxms.

@tobigithub
Copy link

pqcxms is just the bash script, on a multi CPU computer this would be useful. But on a cluster, pqcxms would not be used, so if possible this should be implemented on a lower base. Clusters also have timeout controls (slurm default job time or time limits, same for many other schedulers). So this would be interesting to discuss at which level to implement.

@tobigithub
Copy link

Maybe relevant: https://www.baeldung.com/linux/bash-timeouts (just a random link explaining timeouts in bash)

@tornikeo
Copy link
Contributor Author

Thanks @tobigithub. I have just learned that the authors are working on a completely different method for simulating MS/MS (https://www.stc2024.de/boa.pdf page 131). Also found a paper detailing poor qcxms performance (https://chemrxiv.org/engage/chemrxiv/article-details/65fd8fc09138d2316107fbb6). I doubt this would get implemented (or even be relevant) come near future.

@JayTheDog
Copy link
Member

It depends on what your aim is. First of all, the performance of the new method has to be validated. While based on an entire new approach, there will probably be also issues with "plug-and-play" kind of implementations.

Second, the "poor performance" of the program has to be set into perspective. The authors of this paper clearly disregard that experimental references in MS are often as poorly reproducible (depending on the machine that is used for reference), so that settings have to be optimized for the machine the references were created on (Orbitrap GC-MS/MS), which was clearly not done.
Furthermore, the lack of understanding the workings of QC methods (e.g. using RDkit for confsearch, optimizing geometries on (6-31G level) DFT and then using GFN1-xTB for MDs, cutting low intensity peaks (which is unlogical for the QCxMS approach), comparing agains Orbitrap GC-MS/MS results s.above, just to name a few) shows that these authors are not entirely aware about what they are actually doing. So saying that QCxMS has a "poor performance" in general is pretty short-sighted.
In all the papers published about QCxMS (QCEIMS), a lot of issues with the approach were already addressed and how the settings of the program can influence the outcome should be considered when using the program. It's not a high-throughput method that you can be blindly used, without some adjustments. We never claimed this in the first place.

If a real evaluation is not done, I can't understand how someone (@tornikeo) comes along this repo, asks for help for this free-to-use software and then speaks of "not implementing" this in their software anyways. This underlines that AI people really don't understand physics based approaches and are just out for quick solutions (which running QCxMS is not).

Nobody asked you to implement this approach into your code, and with this kind of attitude, I doubt that the authors will be helping you with any issues moving on forward (also for the next level of the program).

@tornikeo
Copy link
Contributor Author

I apologize for my earlier comments, @JayTheDog. I didn’t mean to disrespect your work. I’m amazed that such an approach is even conceivable, let alone made available for free. I’m learning from it to apply to my own work. No disrespect was intended.

That said, with all due respect, the "the AI people" remark does come across as disrespectful. This is the first time I’ve heard someone use that as an insult. For the record, my work with this repository has nothing to do with AI. Assigning simple labels like “poor performance” or “the AI people” seems to be human nature, I suppose.

Your comments about that paper are intriguing. I’ll do my best to see if I can still salvage my master’s thesis from this. FWIW, I’ve run the default CID example on Oxane from the repository, and unfortunately, the results with all defaults aren’t looking great. Additionally, there’s a small PR (#88) awaiting your approval—it addresses a trivial memory bug.

I had hoped we’d meet under better circumstances, @JayTheDog. I actually have an email drafted for you, for your opinion on my master’s thesis. If done properly, the thesis would've ported some critical parts of MD simulation to a GPU, potentially accelerating the method. I’m unsure if this is feasible, and that’s what I wanted to ask.

Let me know if you’re open to providing some advice on this. If not, I understand and wish you all the best.

@JayTheDog
Copy link
Member

Okay, thanks for setting it into perspective and for the apology.

Tbh, it's not the first time that there is a misconception of the work so forgive me for snapping. It's just that often people just blindly use this tool without understanding the mechanics behind it and then complain that it's not what they expected it to be. And unfortunately it's often the same community that has this issue, and often misunderstand the complexity behind it. As your profil description goes in the same way @tornikeo I felt the need to set this record straight, but I clearly overstepped. So my apologies as well.

Finding a default setting that works for most structures is actually not really possible, as there is more going on in the experiment than even experimentalists often expect (I know that from intense conversations and feedback of companies and producers of those machines, in years of discussion). So for your calculation of Oxane, you probably have to increase the impact energy.
Overall I'm not sure what level of accuracy you are interested in or what the overall topic is, but with this program you probably won't get an overwhelming agreement with the experiment, because of many reasons. So you can send me your questions and maybe I can help shed some light, but if you need this for your master's thesis, I'm not quite sure if you would be able to wait for QCxMS2, as it is not yet clear when it will hit the market.

@tobigithub
Copy link

We have run thousands of QCEIMS experiments (the predecessor) and several thousand QCxMS calculations and have published a number of papers on it [1, 2, 3, 4, 5] but yeah, mostly on workstations (32 to 128 CPU cores) and the majority on HPC with 400-1200 CPU cores. Because it is ab initio-MD (AIMD) the workload is quite substantial and not feasible on a general PC, except for small (<300 Da) molecules.

Even with marginally faster AIMD methods (semiempirical and DFT level) you still would need hundreds of CPU cores or 50-100 GPUs to crunch through a larger number of molecules (n>1000). Right now QCxMS and other tools are more fitted for deep-dive investigations of a few molecules, basically investigating individual trajectories and reaction pathways. But we know from many simulations and comparisons against experimental mass spectra (EI and CID), that there are processes that we do not understand yet, can not explain yet, and can not simulate (yet) even with advanced tools like QCxMS, which I consider the gold standard in ab-initio generation of EI and CID mass spectra.

So yes, QCxMS2 or "Quantum chemical calculation of mass spectra via automated transition state search" from @gorges97 will be an exciting step forward, but I assume that individual transition state calculations across many potential energy surfaces (PES) still will be computationally expensive, because it still requires stochastic sampling. But I could be mistaken. The new approach opens up interesting new horizons, like hybrids of TS search and AIMD. But such quantum chemistry methods will never be faster than machine learning (ML) based methods as Jeroen pointed out, plus ML methods always need training spectra, so no "ab initio" and no "first principles" by design.

Very fast GPU based DFT codes from tools like TeraChem probably would improve speed massively, because most CPU based DFT methods are extremely slow compared to novel GPU based implementations. (Blanket statement, unless I see new benchmarks). There are many other tools like Promethium by QC Ware and GPU4PySCF [Ref]. But interesting times lay ahead.

Regarding your timeout question, here is a modified pqcxms-timeout bash script, it runs 4 jobs with 2 threads each, basically a 8 core CPU (16 threads) and timeout is set to one minute (-m 1). You can create a new script called "pqcxms-timeout" and then use "chmod +x pqcxms-timeout" to make it executable and run it. So every single trajectory will be interrupted after a certain timeout. This should be based on benchmarks otherwise, no trajectory will ever finish. If no timeout parameter is set, it just runs with standard conditions and no timeout.

pqcxms-timeout  -j 4 -t 2 -m 1
Starting 4 qcxms runs, each using 2 threads.
Timeout set to 1 minutes.

The pqcxms-timeout bash script (needs to be set executable with "chmod +x pqcxms-timeout". The change is the actual sub process which is extended by adding a timeout "$((timeout_min * 60))" in front of it.

#!/usr/bin/env bash
# The pqcxms script is used to do parallel computations on a local machine.
# Use either the maximum number of possible threads <nproc> or define the 
# number with 'pqcxms <$1>' and timeout parameter in minutes.

TMPDIR=TMPQCXMS
if [ ! -d "$TMPDIR" ]; then
    echo "No TMPQCXMS folder! -> run qcxms first!"
    exit
fi

usage()
{
  echo " (local) Parallel-QCxMS run-script"
  echo 
  echo " usage: pqcxms  "
  echo "                -j <number of Jobs> "
  echo "                -t <number of OMP_threads> "
  echo "                -m <timeout in minutes> "
  echo ">> If nothing is defined, maximum number of <nproc> is used. <<" 
  echo
  echo " Run QCxMS twice before production runs can start. "
}

#########################################################################
# Balance to get nproc == njob * nthread
nthread=1
njob=$(nproc | awk "{print int(\$1/$nthread)}")
qcxms=$(which qcxms)
timeout_min=0  # Default: no timeout
#########################################################################

# Set options
while [ "$1" != "" ]; do

  PARAM=$(echo "$1" | awk  '{print $1}')
  VALUE=$(echo "$2" | awk  '{print $1}')

  case $PARAM in

    -h | --help)
      usage
      exit
      ;;

    # set number of jobs
    -j | --jobs)
        if [ "$VALUE" -le "$njob" ]; then
          njob="$VALUE"
        elif [ "$VALUE" -gt "$njob" ]; then
          echo "Cannot exceed max number of threads. Reduced to $njob parallel calculations."
          echo
        else
          echo "Provide number of jobs you want to run - E X I T "
          echo
          exit
        fi
        ;;

    # set number of threads (OMP)
    -t | --threads)
      nthread="$VALUE"
      njob=$(nproc | awk "{print int(\$1/$nthread)}")
      ;;

    # set timeout in minutes
    -m | --minutes)
      timeout_min="$VALUE"
      ;;

    *)
      echo "Invalid option - see below "
      usage
      exit
      ;;
  esac

  shift 
  shift 

done

echo 
echo "Starting $njob qcxms runs, each using $nthread threads."
echo
if [ "$timeout_min" -gt 0 ]; then
  echo "Timeout set to $timeout_min minutes per run."
fi

# Remove old files to avoid confusion if job is finished
rm -f qcxms.out 2> /dev/null
rm -f qcxms.res 2> /dev/null
rm -f $TMPDIR/*/qcxms.out 2> /dev/null
rm -f $TMPDIR/*/qcxms.res 2> /dev/null

# Create the wrapper script with timeout support
cat > wrapped_qcxms <<-EOF
#!/usr/bin/env bash
if [ -d "\$1" ]; then
  cd "\$1" > /dev/null 2>&1
  if [ "$timeout_min" -gt 0 ]; then
    # Apply timeout to the individual subprocess
    timeout "$((timeout_min * 60))" bash -c "OMP_NUM_THREADS=$nthread exec \"$qcxms\" --prod > qcxms.out 2>&1"
    if [ \$? -eq 124 ]; then
      echo "Timeout reached for \$1" >> ../qcxms_timeout.log
    fi
  else
    # No timeout if not specified
    OMP_NUM_THREADS=$nthread exec "$qcxms" --prod > qcxms.out 2>&1
  fi
fi
EOF

chmod +x wrapped_qcxms
wrapper=$(realpath wrapped_qcxms)

# Run the parallel jobs, ensuring the global script continues despite subprocess failures
printf "%s\0" $TMPDIR/*/ | xargs -n 1 -P $njob -0 "$wrapper" || true

rm "$wrapper"

# Combine output files
cat $TMPDIR/*/qcxms.out >> qcxms.out || exit 1
cat $TMPDIR/*/qcxms_cid.res >> qcxms_cid.res 2> /dev/null || rm qcxms_cid.res
cat $TMPDIR/*/qcxms.res >> qcxms.res 2> /dev/null || rm qcxms.res

Be aware that this will create a process killed (SIGTERM) error for qcxms basically that is what you see in the qcxms.out file. But overall it works fine, tested on Ubuntu, you might want to control the qcxms.in parameters like number of trajectories or number of collisions accordingly. Also do not oversubscribe the number of CPU cores, everything will be slower. In this case the computation stopped after one minute for each of the trajectories. Also timeout has multiple options "timeout --signal=SIGINT 10" and so on.

image

============================
 Starting Collision Dynamics:
 ============================

 initializing GFN2-xTB ...
 initialization successful!
 energy =      -51.66014
 charge =              1
 mult   =              1
 etemp  =         5000.0
 initializing S=0 state using HS-UHF.

 Est. no. steps (collision):  780    Distance (A):    58.846939

   step  time [fs]    Epot       Ekin       Etot      eTemp    Avg. Temp.
    100    470.    -1405.8881   68.2634    16.5979     5000.     765.160
    200    520.    -1405.2036   67.8019    16.1616     5000.     743.999
    300    570.    -1406.3645   68.8777    17.1948     5000.     738.807
    400    620.    -1405.7204   68.1689    16.5097     5000.     736.191
    500    670.    -1406.0461   68.5467    16.8754     5000.     729.177
    600    720.    -1406.2292   68.8433    17.1653     5000.     726.417
    700    770.    -1403.5336   65.9731    14.3941     5000.     736.135
    800    820.    -1403.3692   65.9548    14.3819     5000.     825.111

COLLISION after    800 steps
STOP      after   1650 steps

    900    870.    -1403.1506   65.7261    14.1613     5000.    1484.228
   1000    920.    -1403.4045   66.1726    14.5985     5000.    1443.843
   1100    970.    -1403.3446   66.0369    14.4649     5000.    1425.634
   1200   1020.    -1403.5591   66.0406    14.4607     5000.    1424.063
   1300   1070.    -1403.7099   66.0643    14.4789     5000.    1433.128
   1400   1120.    -1404.5953   67.3094    15.6915     5000.    1436.552
   1500   1170.    -1403.1457   65.6559    14.0912     5000.    1436.994
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
qcxms              000000000041F23B  Unknown               Unknown  Unknown
qcxms              0000000001FB1090  Unknown               Unknown  Unknown
qcxms              00000000005DFB58  tblite_wavefuncti         116  type.f90
qcxms              0000000001ED5133  Unknown               Unknown  Unknown
qcxms              0000000001E8DE6E  Unknown               Unknown  Unknown
qcxms              0000000001E8F089  Unknown               Unknown  Unknown
qcxms              0000000001E59958  Unknown               Unknown  Unknown
qcxms              00000000005DF8AA  tblite_wavefuncti         112  type.f90
qcxms              0000000000684F2E  tblite_scf_iterat         318  iterator.f90
qcxms              0000000000680195  tblite_scf_iterat         105  iterator.f90
qcxms              0000000000617E6E  tblite_xtb_single         204  singlepoint.f90
qcxms              0000000000514022  qcxms_tblite_mp_g         136  tblite.f90
qcxms              00000000004FB654  qcxms_iniqm_mp_eg         644  iniqm.f90
qcxms              00000000004A8C6D  qcxms_cid_routine         761  cid.f90
qcxms              000000000040EE80  MAIN__                   1646  main.F90
qcxms              0000000000400E72  Unknown               Unknown  Unknown
qcxms              0000000001FB2F69  Unknown               Unknown  Unknown
qcxms              0000000000400D4A  Unknown               Unknown  Unknown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants