Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken streaming of vector of enum with underlying type other than int #16312

Open
1 task done
ktf opened this issue Aug 26, 2024 · 23 comments · May be fixed by #17009
Open
1 task done

Broken streaming of vector of enum with underlying type other than int #16312

ktf opened this issue Aug 26, 2024 · 23 comments · May be fixed by #17009
Assignees
Labels
bug experiment Affects an experiment / reported by its software & computimng experts in:I/O

Comments

@ktf
Copy link
Contributor

ktf commented Aug 26, 2024

Check duplicate issues.

  • Checked for duplicates

Description

I need help to understand an issue which we have when running on Linux on ARM when reading a file which was serialised on x86. Notice that this platform is peculiar, because char (without specifier) is unsigned, and not signed (char sign-ess is implementation detail in the standard).

This is important because mPadSubset that you will see below is an enum PadSubset : char. Running in valgrind, the issue appears as dumped below.

What puzzles me and what I think is the culprit of the segmentation fault is the line:

[1965517:tpc-tracker]:    i= 2, mPadSubset      type= 23, offset= 56, len=2, method=0 [optimized]

as I would have expected it to be len=1. Can you explain me what is going on?

[1965517:tpc-tracker]: ====>Rebuilding TStreamerInfo for class: o2::tpc::CalDet<o2::tpc::PadFlags>, version: 1
[1965517:tpc-tracker]: Creating StreamerInfo for class: o2::tpc::CalDet<o2::tpc::PadFlags>, version: 2
[1965517:tpc-tracker]:
[1965517:tpc-tracker]: StreamerInfo for class: o2::tpc::CalDet<o2::tpc::PadFlags>, version=2, checksum=0x93700773
[1965517:tpc-tracker]:   string         mName           offset=  0 type=300 ,stl=365, ctype=365, name of the object
[1965517:tpc-tracker]:   vector<o2::tpc::CalArray<o2::tpc::PadFlags> > mData           offset= 32 type=300 ,stl=1, ctype=61, internal CalArrays
[1965517:tpc-tracker]:   o2::tpc::PadSubset mPadSubset      offset= 56 type= 3 Pad subset granularity
[1965517:tpc-tracker]:    i= 0, mName           type=300, offset=  0, len=1, method=0
[1965517:tpc-tracker]:    i= 1, mData           type=300, offset= 32, len=1, method=0
[1965517:tpc-tracker]:    i= 2, mPadSubset      type=  3, offset= 56, len=1, method=0
[1965517:tpc-tracker]:
[1965517:tpc-tracker]: StreamerInfo for class: o2::tpc::CalDet<o2::tpc::PadFlags>, version=1, checksum=0x93700773
[1965517:tpc-tracker]:   string         mName           offset=  0 type=300 ,stl=365, ctype=365, name of the object
[1965517:tpc-tracker]:   vector<o2::tpc::CalArray<o2::tpc::PadFlags> > mData           offset= 32 type=300 ,stl=1, ctype=61, internal CalArrays
[1965517:tpc-tracker]:   o2::tpc::PadSubset mPadSubset      offset= 56 type= 3 Pad subset granularity
[1965517:tpc-tracker]:    i= 0, mName           type=300, offset=  0, len=1, method=0
[1965517:tpc-tracker]:    i= 1, mData           type=300, offset= 32, len=1, method=0
[1965517:tpc-tracker]:    i= 2, mPadSubset      type=  3, offset= 56, len=1, method=0
[1965517:tpc-tracker]:
[1965517:tpc-tracker]: ====>Rebuilding TStreamerInfo for class: o2::tpc::CalArray<o2::tpc::PadFlags>, version: 1
[1965517:tpc-tracker]:
[1965517:tpc-tracker]: StreamerInfo for class: o2::tpc::CalArray<o2::tpc::PadFlags>, version=1, checksum=0xb03d18c2
[1965517:tpc-tracker]:   string         mName           offset=  0 type=300 ,stl=365, ctype=365,
[1965517:tpc-tracker]:   vector<o2::tpc::PadFlags> mData           offset= 32 type=300 ,stl=1, ctype=3, calibration data
[1965517:tpc-tracker]:   o2::tpc::PadSubset mPadSubset      offset= 56 type= 3 Subset type
[1965517:tpc-tracker]:   int            mPadSubsetNumber offset= 60 type= 3 Number of the pad subset, e.g. ROC 0 is IROC A00
[1965517:tpc-tracker]:    i= 0, mName           type=300, offset=  0, len=1, method=0
[1965517:tpc-tracker]:    i= 1, mData           type=300, offset= 32, len=1, method=0
[1965517:tpc-tracker]:    i= 2, mPadSubset      type= 23, offset= 56, len=2, method=0 [optimized]
[1965517:tpc-tracker]: ==1965517== Invalid write of size 1
[1965517:tpc-tracker]: ==1965517==    at 0xF36E7A0: frombuf (Bytes.h:313)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7A0: frombuf (Bytes.h:442)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7A0: ReadFastArray (TBufferFile.cxx:1338)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7A0: TBufferFile::ReadFastArray(int*, int) (TBufferFile.cxx:1327)
[1965517:tpc-tracker]: ==1965517==    by 0xF3E580B: void TGenCollectionStreamer::ReadBufferVectorPrimitives<int>(TBuffer&, void*, TClass const*) (TGenCollectionStreamer.cxx:1183)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*) (TBufferFile.cxx:1616)
[1965517:tpc-tracker]: ==1965517==    by 0xF58C84B: int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) (TStreamerInfoReadBuffer.cxx:1297)
[1965517:tpc-tracker]: ==1965517==    by 0xF45B81F: TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1883)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: operator() (TStreamerInfoActions.h:131)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*) (TBufferFile.cxx:3736)
[1965517:tpc-tracker]: ==1965517==    by 0xF482A0F: TStreamerInfoActions::ReadSTLMemberWiseSameClass(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*, short) (TStreamerInfoActions.cxx:1155)
[1965517:tpc-tracker]: ==1965517==    by 0xF482C4F: int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1405)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: operator() (TStreamerInfoActions.h:123)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: ApplySequence (TBufferFile.cxx:3670)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) (TBufferFile.cxx:3661)
[1965517:tpc-tracker]: ==1965517==    by 0xF376CEB: TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) (TBufferFile.cxx:3598)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: TKey::ReadObjectAny(TClass const*) (TKey.cxx:1120)
[1965517:tpc-tracker]: ==1965517==    by 0xF3B82E3: TDirectoryFile::GetObjectChecked(char const*, TClass const*) (TDirectoryFile.cxx:1111)
[1965517:tpc-tracker]: ==1965517==  Address 0x153fbb80 is 0 bytes after a block of size 1,440 alloc'd
[1965517:tpc-tracker]: ==1965517==    at 0x4868908: operator new(unsigned long) (vg_replace_malloc.c:483)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (new_allocator.h:137)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (allocator.h:188)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (alloc_traits.h:464)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: _M_allocate (stl_vector.h:378)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: _M_allocate (stl_vector.h:375)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: std::vector<o2::tpc::PadFlags, std::allocator<o2::tpc::PadFlags> >::_M_default_append(unsigned long) (vector.tcc:650)
[1965517:tpc-tracker]: ==1965517==    by 0xF3E5797: void TGenCollectionStreamer::ReadBufferVectorPrimitives<int>(TBuffer&, void*, TClass const*) (TGenCollectionStreamer.cxx:1176)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*) (TBufferFile.cxx:1616)
[1965517:tpc-tracker]: ==1965517==    by 0xF58C84B: int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) (TStreamerInfoReadBuffer.cxx:1297)
[1965517:tpc-tracker]: ==1965517==    by 0xF45B81F: TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1883)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: operator() (TStreamerInfoActions.h:131)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*) (TBufferFile.cxx:3736)
[1965517:tpc-tracker]: ==1965517==    by 0xF482A0F: TStreamerInfoActions::ReadSTLMemberWiseSameClass(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*, short) (TStreamerInfoActions.cxx:1155)
[1965517:tpc-tracker]: ==1965517==    by 0xF482C4F: int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1405)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: operator() (TStreamerInfoActions.h:123)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: ApplySequence (TBufferFile.cxx:3670)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) (TBufferFile.cxx:3661)
[1965517:tpc-tracker]: ==1965517==    by 0xF376CEB: TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) (TBufferFile.cxx:3598)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: TKey::ReadObjectAny(TClass const*) (TKey.cxx:1120)
[1965517:tpc-tracker]: ==1965517==
[1965517:tpc-tracker]: ==1965517== Invalid write of size 1
[1965517:tpc-tracker]: ==1965517==    at 0xF36E7AC: frombuf (Bytes.h:314)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7AC: frombuf (Bytes.h:442)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7AC: ReadFastArray (TBufferFile.cxx:1338)
[1965517:tpc-tracker]: ==1965517==    by 0xF36E7AC: TBufferFile::ReadFastArray(int*, int) (TBufferFile.cxx:1327)
[1965517:tpc-tracker]: ==1965517==    by 0xF3E580B: void TGenCollectionStreamer::ReadBufferVectorPrimitives<int>(TBuffer&, void*, TClass const*) (TGenCollectionStreamer.cxx:1183)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*) (TBufferFile.cxx:1616)
[1965517:tpc-tracker]: ==1965517==    by 0xF58C84B: int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) (TStreamerInfoReadBuffer.cxx:1297)
[1965517:tpc-tracker]: ==1965517==    by 0xF45B81F: TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1883)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: operator() (TStreamerInfoActions.h:131)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DAAB: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*) (TBufferFile.cxx:3736)
[1965517:tpc-tracker]: ==1965517==    by 0xF482A0F: TStreamerInfoActions::ReadSTLMemberWiseSameClass(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*, short) (TStreamerInfoActions.cxx:1155)
[1965517:tpc-tracker]: ==1965517==    by 0xF482C4F: int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) (TStreamerInfoActions.cxx:1405)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: operator() (TStreamerInfoActions.h:123)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: ApplySequence (TBufferFile.cxx:3670)
[1965517:tpc-tracker]: ==1965517==    by 0xF36DE4B: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) (TBufferFile.cxx:3661)
[1965517:tpc-tracker]: ==1965517==    by 0xF376CEB: TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) (TBufferFile.cxx:3598)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: Streamer (TClass.h:614)
[1965517:tpc-tracker]: ==1965517==    by 0xF3F4633: TKey::ReadObjectAny(TClass const*) (TKey.cxx:1120)
[1965517:tpc-tracker]: ==1965517==    by 0xF3B82E3: TDirectoryFile::GetObjectChecked(char const*, TClass const*) (TDirectoryFile.cxx:1111)
[1965517:tpc-tracker]: ==1965517==  Address 0x153fbb81 is 1 bytes after a block of size 1,440 alloc'd
[1965517:tpc-tracker]: ==1965517==    at 0x4868908: operator new(unsigned long) (vg_replace_malloc.c:483)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (new_allocator.h:137)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (allocator.h:188)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: allocate (alloc_traits.h:464)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: _M_allocate (stl_vector.h:378)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: _M_allocate (stl_vector.h:375)
[1965517:tpc-tracker]: ==1965517==    by 0x60E5D1F: std::vector<o2::tpc::PadFlags, std::allocator<o2::tpc::PadFlags> >::_M_default_append(unsigned long) (vector.tcc:650)
[1965517:tpc-tracker]: ==1965517==    by 0xF3E5797: void TGenCollectionStreamer::ReadBufferVectorPrimitives<int>(TBuffer&, void*, TClass const*) (TGenCollectionStreamer.cxx:1176)
[1965517:tpc-tracker]: ==1965517==    by 0xF36EC7B: Streamer (TClass.h:614)

Reproducer

I do not have one which does not involve running ALICE reconstruction on ARM.

ROOT version

6.32.02.

Installation method

aliBuild

Operating system

ALMA Linux 9 on ARM64 (Ampere Altra)

Additional context

No response

@jblomer
Copy link
Contributor

jblomer commented Aug 27, 2024

Can you give us a bit more information? What would be useful, if possible:

  • The stacktrace from the segfault
  • A description on how to set up the corresponding ALICE environment so that we can look at the dictionaries and headers
  • The ROOT file that caused the crash

Is it confirmed that the same data serialized on ARM does not cause a crash?

@ktf
Copy link
Contributor Author

ktf commented Aug 27, 2024

For the file:

https://cernbox.cern.ch/s/MXkLwJLm61rckhj

I cannot confirm if the same data serialised on ARM does not cause a crash.

@ktf
Copy link
Contributor Author

ktf commented Aug 27, 2024

[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: handle_crash(int)
[1064949:tpc-tracker]:     linux-vdso.so.1:     ?? ??:0
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TBufferFile::ReadFastArray(int*, int)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: void TGenCollectionStreamer::ReadBufferVectorPrimitives<int>(TBuffer&, void*, TClass const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TStreamerInfoActions::ReadSTLMemberWiseSameClass(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*, short)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TKey::ReadObjectAny(TClass const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TDirectoryFile::GetObjectChecked(char const*, TClass const*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: o2::framework::DataRefUtils::decodeCCDB(o2::framework::DataRef const&, std::type_info const&)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2GPUWorkflow.so: decltype(auto) o2::framework::InputRecord::get<o2::tpc::CalDet<o2::tpc::PadFlags>*, char const*>(char const*, int) const
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2GPUWorkflow.so: bool o2::gpu::GPURecoWorkflowSpec::fetchCalibsCCDBTPC<o2::gpu::GPUCalibObjectsTemplate<o2::gpu::ConstPtr> >(o2::framework::ProcessingContext&, o2::gpu::GPUCalibObjectsTemplate<o2::gpu::ConstPtr>&, o2::gpu::GPURecoWorkflowSpec::calibObjectStruct&)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2GPUWorkflow.so: o2::gpu::GPURecoWorkflowSpec::doCalibUpdates(o2::framework::ProcessingContext&, o2::gpu::GPURecoWorkflowSpec::calibObjectStruct&)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2GPUWorkflow.so: o2::gpu::GPURecoWorkflowSpec::run(o2::framework::ProcessingContext&)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so:     ?? ??:0
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::tryDispatchComputation(o2::framework::ServiceRegistryRef, std::vector<o2::framework::DataRelayer::RecordAction, std::allocator<o2::framework::DataRelayer::RecordAction> >&)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::doRun(o2::framework::ServiceRegistryRef)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: o2::framework::run_callback(uv_work_s*)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2Framework.so: o2::framework::DataProcessingDevice::Run()
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/FairMQ/v1.8.4-2/lib/libfairmq.so.1.8.4: fair::mq::Device::RunWrapper()
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/FairMQ/v1.8.4-2/lib/libfairmq.so.1.8.4: boost::detail::function::void_function_obj_invoker1<std::function<void (fair::mq::State)>, void, fair::mq::State>::invoke(boost::detail::function::function_buffer&, fair::mq::State)
[1064949:tpc-tracker]:     /root/src/sw/slc9_aarch64/FairMQ/v1.8.4-2/lib/libfairmq.so.1.8.4: boost::signals2::detail::signal_impl<void (fair::mq::State), boost::signals2::optional_last_value<void>, int, std::less<int>, boost::function<void (fair::mq::State)>, boost::function<void (boost::signals2::connection const&, fair::mq::State)>, boost::signals2::mutex>::operator()(fair::mq::State)

is one of the stacktraces. It actually dies in different ways, most likely there is some memory corruption going on...

@ktf
Copy link
Contributor Author

ktf commented Aug 27, 2024

For the ALICE environment, the easiest is probably sitting together. It's on a custom machine in my private area.

@jblomer
Copy link
Contributor

jblomer commented Aug 27, 2024

Thanks. I'm not at CERN today but getting started with the information.

@jblomer
Copy link
Contributor

jblomer commented Aug 27, 2024

(Side note: MakeProject does not reconstruct the enums with the correct underlying type)

@ktf
Copy link
Contributor Author

ktf commented Aug 27, 2024

Another stacktrace which seems to be related to this is:

[1500611:internal-dpl-ccdb-backend]: Executable is /root/src/sw/slc9_aarch64/O2/dev-local1/bin/o2-tpc-reco-workflow
[1500611:internal-dpl-ccdb-backend]:     linux-vdso.so.1:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     [0xfff3cae9b014]:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     [0xfff3cae9d7f0]:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so:     ?? ??:0
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so: TCling::AutoParseImplRecurse(char const*, bool)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so: TCling::AutoParse(char const*)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so: TClingLookupHelper__AutoParse(char const*)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCling.so: ROOT::TMetaUtils::TClingLookupHelper::GetPartiallyDesugaredNameWithScopeHandling(std::__cxx11::
basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, bool)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCore.so.6.32: TClassEdit::GetNormalizedName(std::__cxx11::basic_string<char, std::char_traits<char>, std:
:allocator<char> >&, std::basic_string_view<char, std::char_traits<char> >)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libCore.so.6.32: TClass::GetClass(char const*, bool, bool, unsigned long, unsigned long)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TStreamerInfo::BuildCheck(TFile*, bool)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TFile::ReadStreamerInfo()
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TFile::Init(bool)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/ROOT/v6-32-02-alice1-1/lib/libRIO.so.6.32: TMemFile::TMemFile(char const*, char*, long long, char const*, char const*, int, long long)
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2CCDB.so: o2::ccdb::CcdbApi::loadFileToMemory(std::vector<char, boost::container::pmr::polymorphic_allocator<char
> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basi
c_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_s
tring<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >*) const
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2CCDB.so: o2::ccdb::CcdbApi::getFromSnapshot(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::
allocator<char> > const&, long, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >,
 std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > con
st, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<char, boost::con
tainer::pmr::polymorphic_allocator<char> >&, int&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2CCDB.so: o2::ccdb::CcdbApi::navigateSourcesAndLoadFile(o2::ccdb::CcdbApi::RequestContext&, int&, unsigned long*)
 const
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2CCDB.so: o2::ccdb::CcdbApi::vectoredLoadFileToMemory(std::vector<o2::ccdb::CcdbApi::RequestContext, std::allocat
or<o2::ccdb::CcdbApi::RequestContext> >&) const
[1500611:internal-dpl-ccdb-backend]:     /root/src/sw/slc9_aarch64/O2/dev-local1/lib/libO2CCDB.so: o2::ccdb::CcdbApi::loadFileToMemory(std::vector<char, boost::container::pmr::polymorphic_allocator<char
> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::bas$

@ktf
Copy link
Contributor Author

ktf commented Aug 27, 2024

Interestingly enough, the actual array returned by backtrace can be decoded by GDB to:

$4 = {0xffffac196fb0 <handle_crash(int)+48>, 0xffffb2f727f0 <__kernel_rt_sigreturn>, 0xfff3ea6f5014, 0xfff3ea6f77f0,
  0xffff9e97b198 <(anonymous namespace)::GenericLLVMIRPlatformSupport::initialize(llvm::orc::JITDylib&)+2392>,
  0xffff9d4b0de0 <cling::IncrementalExecutor::runStaticInitializersOnce(cling::Transaction&)+272>, 0xffff9d435f78 <cling::Interpreter::executeTransaction(cling::Transaction&)+40>,
  0xffff9d4c0e30 <cling::IncrementalParser::commitTransaction(llvm::PointerIntPair<cling::Transaction*, 2u, cling::IncrementalParser::EParseResult, llvm::PointerLikeTypeTraits<cling::Transaction*>, llvm::PointerIntPairInfo<cling::Transaction*, 2u, llvm::PointerLikeTypeTraits<cling::Transaction*> > >&, bool)+768>,
  0xffff9d4c398c <cling::IncrementalParser::Compile(llvm::StringRef, cling::CompilationOptions const&)+108>,
  0xffff9d433d80 <cling::Interpreter::parseForModule(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+176>, 0xffff9d36b5f8
     <ExecAutoParse(char const*, Bool_t, cling::Interpreter*)+568>, 0xffff9d36cf48 <TCling::AutoParseImplRecurse(char const*, bool)+1400>, 0xffff9d374de4 <TCling::AutoParse(char const*)+340>,
  0xffff9d355204 <TClingLookupHelper__AutoParse(char const*)+36>, 0xffff9d2c8b44
     <ROOT::TMetaUtils::TClingLookupHelper::GetPartiallyDesugaredNameWithScopeHandling(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, bool)+116>, 0xffffa7acf42c
     <TClassEdit::GetNormalizedName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::basic_string_view<char, std::char_traits<char> >)+540>, 0xffffa7aeab58
     <TClass::GetClass(char const*, bool, bool, unsigned long, unsigned long)+1144>, 0xffffa7f852b4 <TStreamerInfo::BuildCheck(TFile*, bool)+148>, 0xffffa7f4751c <TFile::ReadStreamerInfo()+700>,
  0xffffa7f4fc40 <TFile::Init(bool)+1056>, 0xffffa7f74a60 <TMemFile::TMemFile(char const*, char*, long long, char const*, char const*, int, long long)+268>, 0xffffac4515b4
     <o2::ccdb::CcdbApi::loadFileToMemory(std::vector<char, boost::container::pmr::polymorphic_allocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >*) const+900>,
  0xffffac451f68 <o2::ccdb::CcdbApi::getFromSnapshot(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<char, boost::container::pmr::polymorphic_allocator<char> >&, int&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+936>,
  0xffffac452100 <o2::ccdb::CcdbApi::navigateSourcesAndLoadFile(o2::ccdb::CcdbApi::RequestContext&, int&, unsigned long*) const+192>,
  0xffffac4524d0 <o2::ccdb::CcdbApi::vectoredLoadFileToMemory(std::vector<o2::ccdb::CcdbApi::RequestContext, std::allocator<o2::ccdb::CcdbApi::RequestContext> >&) const+240>,

@jblomer jblomer self-assigned this Aug 28, 2024
@jblomer
Copy link
Contributor

jblomer commented Aug 28, 2024

Some more points gathered during a debug session:

  • The problem appears only on ARM/Linux, not on ARM/Mac
  • The streamer info output
[1965517:tpc-tracker]:    i= 2, mPadSubset      type= 23, offset= 56, len=2, method=0 [optimized]

does not seem to indicate a problem because the same list of streamer elements also contains the expected

o2::tpc::PadSubset mPadSubset      offset= 56 type= 3 Subset type
  • If the class o2::tpc::CalArray<o2::tpc::PadFlags> is added to the dictionaries (Linkdef), the stacktrace changes and the crash becomes reproducible. In this case, there is an error writing beyond vector boundaries.
  • The next step is to try to reproduce the crash with a debug build of ROOT

@jblomer
Copy link
Contributor

jblomer commented Aug 29, 2024

Further debugging revealed a deeper issue that seem to only by chance surface on ARM/Linux:

Writing or reading a vector of enums goes through the collection proxy. The collection proxy will use WriteFastArray / ReadFastArray of kInt_t, neglecting the actual underlying type of the enum. At some point in the read/write chain, this causes memory reads/writes beyond the limits of a memory array.

@jblomer
Copy link
Contributor

jblomer commented Aug 29, 2024

I think the cause is https://github.com/root-project/root/blob/master/io/io/src/TGenCollectionProxy.cxx#L404 (and similar lines further down), that hard-code the enum underlying type to int.

When fixing, I think we need to take care of what happens to files already written out with the wrong enum width.

@ktf
Copy link
Contributor Author

ktf commented Aug 30, 2024

Do I understand correctly this affects only scoped enums within a vector? Can I simply fix it on my side by moving to enum class Foo : int {}?

@jblomer
Copy link
Contributor

jblomer commented Aug 30, 2024

Although: I'm not exactly sure if already existing files that were serialized with a shorter enum correctly read back. I think yes, but that needs to be tested.

@ktf
Copy link
Contributor Author

ktf commented Aug 30, 2024

Although: I'm not exactly sure if already existing files that were serialized with a shorter enum correctly read back. I think yes, but that needs to be tested.

This I can try on my side.

@jblomer
Copy link
Contributor

jblomer commented Aug 30, 2024

I'm attaching a minimal reproducer.

minimalTestVectorOfEnums.tar.gz

This test returns (wrongly)

Size of PadFlags: 2
Enum underlying type: 12
mFlags size before writing: 2
mFlags size after reading: 4
0 0 23824 0

With a patch to TGenCollectionProxy::Value, the result is correct:

Size of PadFlags: 2
Enum underlying type: 12
mFlags size before writing: 2
mFlags size after reading: 2
0 0

I think the next steps should be discussed with @pcanal. In particular:

  • What about the cases when we only have an emulated enum? With this patch in place, we cannot just assume anymore that this will be an int on disk.
  • In general, how do we correctly handle vectors of enums with underlying types different than int that are on disk, before and after the patch?

@jblomer jblomer changed the title Root unable to read on ARM a class serialised with x86-64 Broken streaming of vector of enum with underlying type different from int Aug 30, 2024
@jblomer jblomer changed the title Broken streaming of vector of enum with underlying type different from int Broken streaming of vector of enum with underlying type other than int Aug 30, 2024
@jblomer jblomer added the experiment Affects an experiment / reported by its software & computimng experts label Aug 30, 2024
@jblomer
Copy link
Contributor

jblomer commented Aug 30, 2024

AFAICT, neither TTree nor RNTuple I/O are affected by this issue.

ktf added a commit to ktf/AliceO2 that referenced this issue Aug 30, 2024
ROOT has an issue when serializing std::vector<T> where T is a scoped
enum backed by something which has size different from int one.

This is true for any architecture, it just happens to be more lethal on
ARM. For more details root-project/root#16312.

Add explicitly types to the linkdef while at it.
shahor02 pushed a commit to shahor02/AliceO2 that referenced this issue Aug 31, 2024
ROOT has an issue when serializing std::vector<T> where T is a scoped
enum backed by something which has size different from int one.

This is true for any architecture, it just happens to be more lethal on
ARM. For more details root-project/root#16312.

Add explicitly types to the linkdef while at it.
@pcanal
Copy link
Member

pcanal commented Sep 5, 2024

[1965517:tpc-tracker]: i= 2, mPadSubset type= 23, offset= 56, len=2, method=0 [optimized]
as I would have expected it to be len=1. Can you explain me what is going on?

If the next data member (which should not be listed right after it) is of the same type, TStreamerInfo will collate them (note the optimized part).

@pcanal
Copy link
Member

pcanal commented Sep 5, 2024

We shall be able to fix the usage in regular I/O and TTree (which is also broken) when using dictionary. The proper support in bare ROOT might be harder (the underlying size information is a bit harder to find and in some case might not be (yet?) available (top level vector of enums)).

@pcanal
Copy link
Member

pcanal commented Sep 5, 2024

In general, how do we correctly handle vectors of enums with underlying types different than int that are on disk, before and after the patch?

With dictionaries, it seems to work fine (for embedded vectors probably not for standalone vector) because the TStreamerInfo of the containing class records the underlying type and thus know when a conversion is needed (The corollary is that a class version number must be updated (to allow schema evolution) if one of the enums type it uses changes its underlying type).

@ktf
Copy link
Contributor Author

ktf commented Sep 6, 2024

For the record, as you might have seen in AliceO2Group/AliceO2#13464, simply changing the types breaks reading back old files (i.e. two shorts are read in an int). Could you comment when do you expect to have a fix for this on your side which applies to 6.32.2 and if it will allow old code to still read new data (and viceversa new code / old data)?

@pcanal
Copy link
Member

pcanal commented Sep 6, 2024

Side note for the record, the original valgrind report and crash happens in the case where the vector<EnumType> is itself held in a vector (of CalArray) held into an object (CalDet).

I have a workaround that solves the problem for the case in the minimal reproducer which resolves around setting a read rule for the vector of enums:

template <typename E>
void LoadEnumCollection(/* const */ std::vector<E> &onfile, std::vector<E> &enums)
{
   constexpr size_t delta = sizeof(int)/sizeof(E);
   const size_t nvalues = onfile.size() / delta;
   onfile.resize(nvalues);
   std::swap(onfile, enums);
};
#pragma read sourceClass="Event" checksums="[0xa2558fd6]" targetClass="Event" source="std::vector<PadFlags> mFlags" target="mFlags" code="{ LoadEnumCollection(onfile.mFlags, mFlags); }"

However it does not work yet for the actual/original problem :(. (In the minimal reproducer the size of the container is double what it should be has no over-write/crash, while in the original the container ends up with the right size but with an over-write and thus crash).

@pcanal
Copy link
Member

pcanal commented Sep 6, 2024

The following custom Streamer works around the issue:

template <typename Flags>
inline void CalArray<Flags>::Streamer(TBuffer &R__b)
{
   // Stream an object of class CalArray<PadFlags>.

   if (R__b.IsReading()) {
      UInt_t R__s, R__c;
      Version_t R__v = R__b.ReadVersion(&R__s, &R__c);
      if (R__v <= 3) {
         {
            UInt_t start, count;
            Version_t vers = R__b.ReadVersion(&start, &count);

            std::vector<int> R__stl;
            R__stl.clear();
            int R__n;
            R__b >> R__n;
            R__stl.reserve(R__n);
            for (int R__i = 0; R__i < R__n; R__i++) {
               Int_t readtemp;
               R__b >> readtemp;
               R__stl.push_back(readtemp);
            }
            R__b.CheckByteCount(start, count, "stl collection of enums");

            mFlags.clear();
            auto data = reinterpret_cast<unsigned short*>(R__stl.data());
            constexpr size_t delta = sizeof(int)/sizeof(Flags);
            for(int i = 0; i < R__n; ++i)
               mFlags.push_back(static_cast<PadFlags>( data[i] ));
         }
         int tmp;
         R__b >> tmp;
         mPadSubset = static_cast<PadSubset>(tmp);

         R__b.CheckByteCount(R__s, R__c, CalArray::IsA());
      } else {
         R__b.ReadClassBuffer(CalArray<Flags>::Class(),this, R__v, R__s, R__c);
      }
   } else {
      R__b.WriteClassBuffer(CalArray<Flags>::Class(),this);
   }
}

[Call to ReadClassBuffer was corrected to add missing parameters]

@ktf
Copy link
Contributor Author

ktf commented Oct 31, 2024

Any followup to the bug itself? Will we have a fix in ROOT which avoids a custom streamer?

@pcanal pcanal linked a pull request Nov 22, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug experiment Affects an experiment / reported by its software & computimng experts in:I/O
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@ktf @jblomer @pcanal @dpiparo and others