Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relion tomo segmentation fault in Refine3D #1166

Open
dacolombo opened this issue Jul 18, 2024 · 1 comment
Open

Relion tomo segmentation fault in Refine3D #1166

dacolombo opened this issue Jul 18, 2024 · 1 comment

Comments

@dacolombo
Copy link

Describe your problem
Running Refine3D on tomo data returns segmentation fault at the very beginning of the first iteration, this happens both with and without using MPI.

This error is probably related in some way to the dataset, since it doesn't happen with other datasets I have tested, but I cannot find anything wrong in particular in the data itself.

Environment:

  • OS: CentOS 8
  • MPI runtime: intel-oneapi-mpi v2021.4.0
  • RELION version: Relion-5.0-beta-2-commit-a0b145
  • Memory: 200GB
  • GPU: Tesla V100

Dataset:

  • Box size: 192
  • Pixel size: 2.2 Å/px
  • Number of particles: 2098
  • Description: ribosomes

Job options:

  • Type of job: Refine3D
  • Number of MPI processes: 1
  • Number of threads: 7
  • Full command:
    valgrind --track-origins=yes `which relion_refine` --nr_parts_sigma2noise 1000 --o Refine3D/test1/run --auto_refine --ios bin2_ribosomes_2D_optimisation_set.star --ref bin2_av_ribo.mrc --firstiter_cc --trust_ref_size --ini_high 40 --dont_combine_weights_via_disc --pool 30 --pad 2  --ctf --particle_diameter 380 --flatten_solvent --zero_mask --solvent_mask mask.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 7 --gpu  --pipeline_control Refine3D/test1/
    

Error message:

This is the error of the job run with valgrind:

==3609768== Memcheck, a memory error detector
==3609768== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3609768== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==3609768== Command: /apps/spack/latest_x86_64/linux-centos8-x86_64/gcc-8.4.1/relion-5.0-beta-ha23c4r27qo4nzfwxwco6zxjkarbgmtp/bin/relion_refine --nr_parts_sigma2noise 1000 --o Refine3D/test1/run --auto_refine --ios bin2_ribosomes_2D_optimisation_set.star --ref bin2_av_ribo.mrc --firstiter_cc --trust_ref_size --ini_high 40 --dont_combine_weights_via_disc --pool 30 --pad 2 --ctf --particle_diameter 380 --flatten_solvent --zero_mask --solvent_mask mask.mrc --solvent_correct_fsc --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 7 --gpu --pipeline_control Refine3D/test1/
==3609768== 
==3609768== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints.
==3609768==    This could cause spurious value errors to appear.
==3609768==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==3609768== Warning: noted but unhandled ioctl 0x27 with no size/direction hints.
==3609768==    This could cause spurious value errors to appear.
==3609768==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==3609768== Warning: noted but unhandled ioctl 0x25 with no size/direction hints.
==3609768==    This could cause spurious value errors to appear.
==3609768==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==3609768== Warning: noted but unhandled ioctl 0x17 with no size/direction hints.
==3609768==    This could cause spurious value errors to appear.
==3609768==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==3609768== Warning: set address range perms: large range [0x200000000, 0x400200000) (noaccess)
==3609768== Warning: set address range perms: large range [0x1c133000, 0x3c132000) (noaccess)
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_016_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_015_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_018_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_022_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_018_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_017_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_013_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_021_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_022_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_013_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_016_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_003_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_017_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_015_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_020_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_014_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_002_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_014_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_003_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_002_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_001_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_020_s2.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_021_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
 WARNING: tomogram 20200202_riboprot_a010_1_tomo_001_s1.tomostar has relion-4 definition of projection matrices; converting them now... 
==3609768== Warning: noted but unhandled ioctl 0x19 with no size/direction hints.
==3609768==    This could cause spurious value errors to appear.
==3609768==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==3609768== Warning: noted but unhandled ioctl 0x49 with no size/direction hints.
==3609768==    This could cause spurious value errors to appear.
==3609768==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==3609768== Warning: noted but unhandled ioctl 0x21 with no size/direction hints.
==3609768==    This could cause spurious value errors to appear.
==3609768==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==3609768== Warning: noted but unhandled ioctl 0x1b with no size/direction hints.
==3609768==    This could cause spurious value errors to appear.
==3609768==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==3609768== Warning: noted but unhandled ioctl 0x44 with no size/direction hints.
==3609768==    This could cause spurious value errors to appear.
==3609768==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==3609768== Warning: noted but unhandled ioctl 0x48 with no size/direction hints.
==3609768==    This could cause spurious value errors to appear.
==3609768==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==3609768== Warning: set address range perms: large range [0x59e99000, 0x6be98000) (noaccess)
==3609768== Warning: set address range perms: large range [0x6a000000, 0xabfff000) (noaccess)
==3609768== Warning: set address range perms: large range [0x6a000000, 0xaa000000) (noaccess)
==3609768== Warning: set address range perms: large range [0x6e000000, 0x7ffff000) (noaccess)
==3609768== Warning: set address range perms: large range [0x400200000, 0xbae1ff000) (noaccess)
==3609768== Warning: set address range perms: large range [0x1052fbc000, 0x180cfbb000) (noaccess)
==3609768== Thread 5:
==3609768== Invalid read of size 8
==3609768==    at 0x699A70: void getAllSquaredDifferencesCoarse<MlOptimiserCuda>(unsigned int, OptimisationParamters&, SamplingParameters&, MlOptimiser*, MlOptimiserCuda*, AccPtr<float>&, AccPtrFactory, int) (acc_ml_optimiser_impl.h:1151)
==3609768==    by 0x6BB0FD: void accDoExpectationOneParticle<MlOptimiserCuda>(MlOptimiserCuda*, unsigned long, int, AccPtrFactory) (acc_ml_optimiser_impl.h:3838)
==3609768==    by 0x68F1E9: MlOptimiserCuda::doThreadExpectationSomeParticles(int) (cuda_ml_optimiser.cu:284)
==3609768==    by 0x4EF6CC: globalThreadExpectationSomeParticles(void*, int) (ml_optimiser.cpp:84)
==3609768==    by 0x4EF744: MlOptimiser::expectationSomeParticles(long, long) [clone ._omp_fn.0] (ml_optimiser.cpp:4262)
==3609768==    by 0x1702F055: ??? (in /cm/local/apps/gcc/10.2.0/lib64/libgomp.so.1.0.0)
==3609768==    by 0xF3D3149: start_thread (in /usr/lib64/libpthread-2.28.so)
==3609768==    by 0x17567DC2: clone (in /usr/lib64/libc-2.28.so)
==3609768==  Address 0x80 is not stack'd, malloc'd or (recently) free'd
==3609768== 
==3609768== 
==3609768== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3609768==  Access not within mapped region at address 0x80
==3609768==    at 0x699A70: void getAllSquaredDifferencesCoarse<MlOptimiserCuda>(unsigned int, OptimisationParamters&, SamplingParameters&, MlOptimiser*, MlOptimiserCuda*, AccPtr<float>&, AccPtrFactory, int) (acc_ml_optimiser_impl.h:1151)
==3609768==    by 0x6BB0FD: void accDoExpectationOneParticle<MlOptimiserCuda>(MlOptimiserCuda*, unsigned long, int, AccPtrFactory) (acc_ml_optimiser_impl.h:3838)
==3609768==    by 0x68F1E9: MlOptimiserCuda::doThreadExpectationSomeParticles(int) (cuda_ml_optimiser.cu:284)
==3609768==    by 0x4EF6CC: globalThreadExpectationSomeParticles(void*, int) (ml_optimiser.cpp:84)
==3609768==    by 0x4EF744: MlOptimiser::expectationSomeParticles(long, long) [clone ._omp_fn.0] (ml_optimiser.cpp:4262)
==3609768==    by 0x1702F055: ??? (in /cm/local/apps/gcc/10.2.0/lib64/libgomp.so.1.0.0)
==3609768==    by 0xF3D3149: start_thread (in /usr/lib64/libpthread-2.28.so)
==3609768==    by 0x17567DC2: clone (in /usr/lib64/libc-2.28.so)
==3609768==  If you believe this happened as a result of a stack
==3609768==  overflow in your program's main thread (unlikely but
==3609768==  possible), you can try to increase the size of the
==3609768==  main thread stack using the --main-stacksize= flag.
==3609768==  The main thread stack size used in this run was 16777216.
==3609768== 
==3609768== HEAP SUMMARY:
==3609768==     in use at exit: 1,772,650,840 bytes in 223,650 blocks
==3609768==   total heap usage: 3,894,011 allocs, 3,670,361 frees, 34,875,110,046 bytes allocated
==3609768== 
==3609768== LEAK SUMMARY:
==3609768==    definitely lost: 224 bytes in 3 blocks
==3609768==    indirectly lost: 0 bytes in 0 blocks
==3609768==      possibly lost: 96,400 bytes in 1,140 blocks
==3609768==    still reachable: 1,772,554,216 bytes in 222,507 blocks
==3609768==                       of which reachable via heuristic:
==3609768==                         stdstring          : 4,452 bytes in 28 blocks
==3609768==         suppressed: 0 bytes in 0 blocks
==3609768== Rerun with --leak-check=full to see details of leaked memory
==3609768== 
==3609768== For lists of detected and suppressed errors, rerun with: -s
==3609768== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
/cm/local/apps/slurm/var/spool/job13688254/slurm_script: line 41: 3609768 Segmentation fault      (core dumped) valgrind --track-origins=yes `which relion_refine` --nr_parts_sigma2noise 1000 --o Refine3D/test1/run --auto_refine --ios bin2_ribosomes_2D_optimisation_set.star --ref bin2_av_ribo.mrc --firstiter_cc --trust_ref_size --ini_high 40 --dont_combine_weights_via_disc --pool 30 --pad 2 --ctf --particle_diameter 380 --flatten_solvent --zero_mask --solvent_mask mask.mrc --solvent_correct_fsc --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 7 --gpu --pipeline_control Refine3D/test1/

Let me know if there is any additional information you need on the job and / or data.
Do you have any suggestions on what could be the issue? Thanks!

@xinsheng44
Copy link

I was in the same situation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants