Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpl2 recurses itself to a seg fault on BoomFrontend #6083

Open
jeffng-or opened this issue Nov 1, 2024 · 8 comments
Open

mpl2 recurses itself to a seg fault on BoomFrontend #6083

jeffng-or opened this issue Nov 1, 2024 · 8 comments
Assignees
Labels
mpl Macro Placement par Partitioning

Comments

@jeffng-or
Copy link
Contributor

Describe the bug

The macro placer runs for 14h before seg faulting on BoomFrontend, which is a sub-module of BoomTile. Note that the segfault isn't seen in the full BoomTile run, which runs for about 1h.

I've re-run the job in GDB and mpl2 is infinitely recursing itself into oblivion. Here's a snippet of the stack trace:

 
Thread 1 "openroad" received signal SIGSEGV, Segmentation fault.
0x00007ffff3294c5c in __pthread_create_2_1 (newthread=0x556604b97050, attr=0x0, start_routine=0x7ffff36dc240, arg=0x5566029a42b0) at ./nptl/pthread_create.c:621
621	./nptl/pthread_create.c: No such file or directory.
(gdb) where
#0  0x00007ffff3294c5c in __pthread_create_2_1 (newthread=0x556604b97050, 
    attr=0x0, start_routine=0x7ffff36dc240, arg=0x5566029a42b0)
    at ./nptl/pthread_create.c:621
#1  0x00007ffff36dc329 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x0000555558597310 in par::KWayFMRefine::InitializeGainBucketsKWay(std::vector<std::shared_ptr<par::PriorityQueue>, std::allocator<std::shared_ptr<par::PriorityQueue> > >&, std::shared_ptr<par::Hypergraph> const&, std::vector<int, std::allocator<int> > const&, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&) const ()
#3  0x00005555585988ff in par::KWayFMRefine::Pass(std::shared_ptr<par::Hypergraph> const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >&, std::vector<float, std::allocator<float> >&, std::vector<int, std::allocator<int> >&, std::vector<bool, std::allocator<bool> >&) ()
#4  0x0000555558583528 in par::Refiner::Refine(std::shared_ptr<par::Hypergraph> const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<int, std::allocator<int> >&) ()
#5  0x00005555585775aa in par::MultilevelPartitioner::InitialPartition(std::shared_ptr<par::Hypergraph> const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >&, int&) const ()
#6  0x000055555857a144 in par::MultilevelPartitioner::SingleLevelPartition(std::shared_ptr<par::Hypergraph> const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&) const ()
#7  0x000055555857a7d4 in par::MultilevelPartitioner::Partition(std::shared_ptr<par::Hypergraph> const&, std::vector<std::vector<float, std::allocator<float> >,--Type <RET> for more, q to quit, c to continue without paging--
#8  0x000055555854860f in par::TritonPart::MultiLevelPartition() ()
#9  0x000055555854a339 in par::TritonPart::PartitionKWaySimpleMode(unsigned int, float, unsigned int, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) ()
#10 0x000055555853f6e6 in par::PartitionMgr::PartitionKWaySimpleMode(unsigned int, float, unsigned int, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) ()
#11 0x00005555581ba37c in mpl2::ClusteringEngine::breakLargeFlatCluster(mpl2::Cluster*) ()
#12 0x00005555581ba62b in mpl2::ClusteringEngine::breakLargeFlatCluster(mpl2::Cluster*) ()
#13 0x00005555581ba63a in mpl2::ClusteringEngine::breakLargeFlatCluster(mpl2::Cluster*) ()
...
#11113 0x00005555581bba5b in mpl2::ClusteringEngine::updateSubTree(mpl2::Cluster*) ()
#11114 0x00005555581c45b5 in mpl2::ClusteringEngine::multilevelAutocluster(mpl2::Cluster*) ()
#11115 0x00005555581c4ee7 in mpl2::ClusteringEngine::run() ()
#11116 0x0000555558145a48 in mpl2::HierRTLMP::runMultilevelAutoclustering() ()

The full-ish stack trace can be found at: https://drive.google.com/file/d/10MMydy8f761RPeXXE5FKgIVFDAtlWwCn/view?usp=sharing

The tarball can be found at: https://drive.google.com/file/d/1PH8jZAREhRn4NIVryR7pes3sKNGSIBqs/view?usp=sharing

Expected Behavior

Successful mpl2 run without a seg fault and running less than 1h

Environment

commit defc349ec719f45e115b85e317e95db86769e439 (HEAD -> master, origin/master, origin/HEAD)
Merge: e30f8fc8 8c3afb17
Author: Matt Liberty <[email protected]>
Date:   Mon Oct 28 18:28:51 2024 -0700

    Merge pull request #2520 from Pinata-Consulting/makefile-do-floorplan-fix-2
    
    makefile: fix one more do-floorplan gaffe

To Reproduce

  1. unpack the tarball (link in the description)
  2. source your ORFS env.sh
  3. execute run-me-BoomFrontend-asap7-base.sh

Relevant log output

No response

Screenshots

No response

Additional Context

No response

@eder-matheus eder-matheus added the mpl Macro Placement label Nov 4, 2024
@AcKoucher
Copy link
Contributor

@jeffng-or Apparently it's not mpl2 itself that is blowing up. During clustering, we call par (TritonPart) to partition big flat clusters i.e., big clusters made of only leaf macros/std cells. Based on your log, the segfault is happening inside par.

@AcKoucher AcKoucher added the par Partitioning label Nov 4, 2024
@jeffng-or
Copy link
Contributor Author

@jeffng-or Apparently it's not mpl2 itself that is blowing up. During clustering, we call par (TritonPart) to partition big flat clusters i.e., big clusters made of only leaf macros/std cells. Based on your log, the segfault is happening inside par.

Sure, makes sense. The key point is that breakLargeFlatCluster recurses down 11100 frames (I think I cut the stack trace file off one level too soon, so my bad on that). The fact that we down effectively infinitely will eventually cause a failure somewhere and it happens to be in par.

@maliberty
Copy link
Member

@AcKoucher the end of the stack is in par but most of the stack is in mpl2. I think the problem is the recursion in breakLargeFlatCluster. How many parts are we trying to break this cluster down into? I suspect something is off in the cluster size.

@AcKoucher
Copy link
Contributor

@maliberty I see. I'll investigate.

@maliberty
Copy link
Member

If that much splitting is necessary then you can write it non-recursively.

@AcKoucher
Copy link
Contributor

Apparently TritonPart is doing a terrible job when trying partitioning (ftq)_glue_logic

[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0: Num Macros: 1 Num Std Cells: 31578
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0: Num Macros: 1 Num Std Cells: 31578
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0: Num Macros: 1 Num Std Cells: 31578
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0: Num Macros: 1 Num Std Cells: 31578
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0: Num Macros: 1 Num Std Cells: 31198
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_1: Num Macros: 0 Num Std Cells: 380
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0: Num Macros: 1 Num Std Cells: 31198
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0_0_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_0: Num Macros: 1 Num Std Cells: 31198
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0_0_0_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_0_0: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_0_1: Num Macros: 1 Num Std Cells: 31198
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0_0_0_0_0_1 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_0_1_0: Num Macros: 1 Num Std Cells: 31198
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_0_1_1: Num Macros: 0 Num Std Cells: 0
[sinks into oblivion ...]

@AcKoucher
Copy link
Contributor

@maliberty I'm not sure how to proceed here. Should mpl2 reject the result and take care of splitting the cluster if the partitions generated by TritonPart are not good?

@maliberty
Copy link
Member

I think TP should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpl Macro Placement par Partitioning
Projects
None yet
Development

No branches or pull requests

4 participants