Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question(FQDN): how to connect to Pegasus cluster with FQDN pegasus version #2007

Closed
ninsmiracle opened this issue May 10, 2024 · 7 comments
Closed

Comments

@ninsmiracle
Copy link
Contributor

ninsmiracle commented May 10, 2024

General Question

When I deploy master branch of pegasus to real cluster, I could not connect to peagsus via peagsus_shell.

  1. Firstly , I change all the IP to hostname in pegasus config
  2. Then I deloy it to machines
  3. I connected to peagsus cluster via admlin-cli,such as use this command ./admin-cli -n aaa:25101,bbb:25101,but return fatal: failed to list nodes [context deadline exceeded]
  4. I connected to pegasus cluster via pegasus-shell. It works. However,when I type nodes -d ,cluster crash.

stdout(error log) in meta server:

I2024-05-08 14:13:57.603 (1715148837603905326 81668) : pegasus server starting, pid(81668), version($Version: Pegasus Server 2.6.0-SNAPSHOT (aea1cfe632d455fcddfe4c92ebbd9d4e89037abb) Release, built by gcc 7.3.1, built on 12180ab51819, built at May  7 2024 12:14:31 $)
F2024-05-08 14:15:26.215 (1715148926215608204 81749)   meta.THREAD_POOL_META_SERVER3.02003f3d00010001: 

rpc_host_port.cpp:62:from_address(): assertion expression: [utils::hostname_from_ip(__bswap_32 (addr.ip()), &hp._host)] invalid host_port 172.17.0.1

172.17.0.1 is my pegasus-shell IP , which is in a docker. It looks like peagsus can not resolve this IP correctly, it's a bug?

I also fonud these coredump in replica servers.

Program terminated with signal SIGABRT, Aborted.
#0  0x00007ffaedff01d7 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffaedff01d7 in raise () from /lib64/libc.so.6
#1  0x00007ffaedff18c8 in abort () from /lib64/libc.so.6
#2  0x00007ffaf240ca1e in dsn_coredump () at /home/guoningshen/code/incubator-pegasus/src/runtime/service_api_c.cpp:130
#3  0x00007ffaef3e8134 in process_fatal_log (log_level=<optimized out>) at /home/guoningshen/code/incubator-pegasus/src/utils/simple_logger.cpp:117
#4  dsn::tools::simple_logger::log (this=0x1a38200, file=<optimized out>, function=<optimized out>, line=<optimized out>, log_level=<optimized out>, str=<optimized out>)
    at /home/guoningshen/code/incubator-pegasus/src/utils/simple_logger.cpp:284
#5  0x00007ffaf21ec19b in dsn::replication::replica_stub::open_replica (this=0x1851800, app=..., id=..., group_check=..., configuration_update=...)
    at /home/guoningshen/code/incubator-pegasus/src/replica/replica_stub.cpp:1817
#6  0x00007ffaf2447be1 in dsn::task::exec_internal (this=0x1f50b40) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task.cpp:173
#7  0x00007ffaf245f257 in dsn::task_worker::loop (this=0x1717290) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task_worker.cpp:245
#8  0x00007ffaf245fdc0 in dsn::task_worker::run_internal (this=0x1717290) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task_worker.cpp:225
#9  0x00007ffaf0ed9a3f in execute_native_thread_routine () from /home/work/app/pegasus/c3tst-performance1/replica/package/bin/librocksdb.so.8
#10 0x00007ffaef66edc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffaee0b273d in clone () from /lib64/libc.so.6
(gdb)
@ninsmiracle
Copy link
Contributor Author

ninsmiracle commented May 10, 2024

So I want to know what should I do , to deloy a peagsus cluster with FQDN now , and how to use tools control this cluster. Thanks a lot. @acelyc111

@ninsmiracle ninsmiracle changed the title Question(FQDN): how to connect to Pegasus cluster with FQDN Question(FQDN): how to connect to Pegasus cluster with FQDN pegasus version May 10, 2024
@ninsmiracle
Copy link
Contributor Author

Let me add more details:

  1. deploy clusters,it works. Every nodes running...

  2. useing peagsus-shell to connected to cluster
    image

  3. send any RPC command , like nodes -dr or ls -d. TIME_OUT
    image

4.A lot of core in meta-server
image

Core like core.meta.THREAD_PO...

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/work/app/pegasus/c3tst-performance1/meta/package/bin/pegasus_server confi'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f3c0c8bc1d7 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f3c0c8bc1d7 in raise () from /lib64/libc.so.6
#1  0x00007f3c0c8bd8c8 in abort () from /lib64/libc.so.6
#2  0x00007f3c10cd8a1e in dsn_coredump () at /home/guoningshen/code/incubator-pegasus/src/runtime/service_api_c.cpp:130
#3  0x00007f3c0dcb4134 in process_fatal_log (log_level=<optimized out>) at /home/guoningshen/code/incubator-pegasus/src/utils/simple_logger.cpp:117
#4  dsn::tools::simple_logger::log (this=0x2e3a200, file=<optimized out>, function=<optimized out>, line=<optimized out>, log_level=<optimized out>, str=<optimized out>)
    at /home/guoningshen/code/incubator-pegasus/src/utils/simple_logger.cpp:284
#5  0x00007f3c10d09ff3 in dsn::host_port::from_address (addr=...) at /home/guoningshen/code/incubator-pegasus/src/runtime/rpc/rpc_host_port.cpp:60
#6  0x00007f3c10d0f0c5 in dsn::message_ex::create_response (this=this@entry=0x327be00) at /home/guoningshen/code/incubator-pegasus/src/runtime/rpc/rpc_message.cpp:358
#7  0x00007f3c10d0638d in dsn::rpc_engine::forward (this=this@entry=0x2c4f180, request=request@entry=0x327be00, address=...) at /home/guoningshen/code/incubator-pegasus/src/runtime/rpc/rpc_engine.cpp:853
#8  0x00007f3c10cd90a3 in dsn_rpc_forward (request=0x327be00, addr=...) at /home/guoningshen/code/incubator-pegasus/src/runtime/service_api_c.cpp:207
#9  0x00007f3c0ffc6196 in forward (addr=..., this=0x7f3bee4e5f20) at /home/guoningshen/code/incubator-pegasus/src/runtime/rpc/rpc_holder.h:224
#10 dsn::replication::meta_service::check_leader<dsn::rpc_holder<dsn::replication::configuration_list_apps_request, dsn::replication::configuration_list_apps_response> > (this=this@entry=0x32ee000, 
    rpc=..., forward_address=<optimized out>) at /home/guoningshen/code/incubator-pegasus/src/meta/meta_service.h:406
#11 0x00007f3c0ffc629a in dsn::replication::meta_service::check_leader_status<dsn::rpc_holder<dsn::replication::configuration_list_apps_request, dsn::replication::configuration_list_apps_response> > (
    this=this@entry=0x32ee000, rpc=..., forward_address=forward_address@entry=0x0) at /home/guoningshen/code/incubator-pegasus/src/meta/meta_service.h:420
#12 0x00007f3c0ff9ef6a in dsn::replication::meta_service::on_list_apps (this=0x32ee000, rpc=...) at /home/guoningshen/code/incubator-pegasus/src/meta/meta_service.cpp:671
#13 0x00007f3c0fff8653 in operator() (request=<optimized out>, __closure=<optimized out>) at /home/guoningshen/code/incubator-pegasus/src/runtime/serverlet.h:201
#14 std::_Function_handler<void (dsn::message_ex*), bool dsn::serverlet<dsn::replication::meta_service>::register_rpc_handler_with_rpc_holder<dsn::rpc_holder<dsn::replication::configuration_list_apps_request, dsn::replication::configuration_list_apps_response> >(dsn::task_code, char const*, void (dsn::replication::meta_service::*)(dsn::rpc_holder<dsn::replication::configuration_list_apps_request, dsn::replication::configuration_list_apps_response>))::{lambda(dsn::message_ex*)#1}>::_M_invoke(std::_Any_data const&, dsn::message_ex*&&) (__functor=..., __args#0=<optimized out>)
    at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:316
#15 0x00007f3c10d123b2 in operator() (__args#0=<optimized out>, this=0x2b310d0) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:706
#16 dsn::rpc_request_task::exec (this=0x2b31000) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task.h:436
#17 0x00007f3c10d13be1 in dsn::task::exec_internal (this=0x2b31000) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task.cpp:173
#18 0x00007f3c10d2b257 in dsn::task_worker::loop (this=0x2b19290) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task_worker.cpp:245
#19 0x00007f3c10d2bdc0 in dsn::task_worker::run_internal (this=0x2b19290) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task_worker.cpp:225
#20 0x00007f3c0f7a5a3f in execute_native_thread_routine () from /home/work/app/pegasus/c3tst-performance1/meta/package/bin/librocksdb.so.8
#21 0x00007f3c0df3adc5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f3c0c97e73d in clone () from /lib64/libc.so.6
(gdb) 

Core like core.pegasus_server....

#0  0x0000000000000000 in ?? ()
#1  0x00007f693f83b6c0 in (anonymous namespace)::stacktrace_generic_fp::capture<false, false> (result=result@entry=0xaee010, max_depth=31, skip_count=1, initial_frame=initial_frame@entry=0x7ffd328eae80, 
    initial_pc=initial_pc@entry=0x0, sizes=0x0) at src/stacktrace_generic_fp-inl.h:175
#2  0x00007f693f83b74a in GetStackTrace_generic_fp (result=0xaee010, max_depth=<optimized out>, skip_count=<optimized out>) at src/stacktrace_generic_fp-inl.h:332
#3  0x00007f693f83ba52 in GetStackTrace (result=result@entry=0xaee010, max_depth=max_depth@entry=30, skip_count=skip_count@entry=0) at src/stacktrace.cc:346
#4  0x00007f693f82c37e in tcmalloc::PageHeap::HandleUnlock (this=0x7f693fa56720 <tcmalloc::Static::pageheap_>, context=0x7ffd328eaf10) at src/page_heap.cc:155
#5  0x00007f693f82e07a in ~LockingContext (this=0x7ffd328eaf10, __in_chrg=<optimized out>) at src/page_heap.cc:77
#6  tcmalloc::PageHeap::NewWithSizeClass (this=this@entry=0x7f693fa56720 <tcmalloc::Static::pageheap_>, n=n@entry=1, sizeclass=26) at src/page_heap.cc:161
#7  0x00007f693f82beb7 in tcmalloc::CentralFreeList::Populate (this=this@entry=0x7f693fbe1420 <tcmalloc::Static::central_cache_+31616>) at src/central_freelist.cc:314
#8  0x00007f693f82c088 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe (this=0x7f693fbe1420 <tcmalloc::Static::central_cache_+31616>, N=1, start=0x7ffd328eb020, end=0x7ffd328eb028)
    at src/central_freelist.cc:273
#9  0x00007f693f82c120 in tcmalloc::CentralFreeList::RemoveRange (this=0x7f693fbe1420 <tcmalloc::Static::central_cache_+31616>, start=start@entry=0x7ffd328eb020, end=end@entry=0x7ffd328eb028, N=1)
    at src/central_freelist.cc:253
#10 0x00007f693f82fca3 in tcmalloc::ThreadCache::FetchFromCentralCache (this=this@entry=0xb0e000, cl=cl@entry=26, byte_size=byte_size@entry=576, 
    oom_handler=oom_handler@entry=0x7f693f81d240 <(anonymous namespace)::nop_oom_handler(size_t)>) at src/thread_cache.cc:125
#11 0x00007f693f83f15d in Allocate (oom_handler=0x7f693f81d240 <(anonymous namespace)::nop_oom_handler(size_t)>, cl=26, size=576, this=<optimized out>) at src/thread_cache.h:381
#12 do_malloc (size=568) at src/tcmalloc.cc:1414
#13 do_allocate_full<tcmalloc::malloc_oom> (size=568) at src/tcmalloc.cc:1804
#14 tcmalloc::allocate_full_malloc_oom (size=568) at src/tcmalloc.cc:1820
#15 0x00007f693dfa754d in __fopen_internal () from /lib64/libc.so.6
#16 0x00007f693ca60a16 in selinuxfs_exists () from /lib64/libselinux.so.1
#17 0x00007f693ca58ce8 in init_lib () from /lib64/libselinux.so.1
#18 0x00007f6943dfd1e3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#19 0x00007f6943def21a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#20 0x0000000000000004 in ?? ()
#21 0x00007ffd328ed220 in ?? ()
#22 0x00007ffd328ed26a in ?? ()
#23 0x00007ffd328ed275 in ?? ()
#24 0x00007ffd328ed27f in ?? ()
#25 0x0000000000000000 in ?? ()
(gdb) 
  1. stdout(error log) in meta-server
W2024-05-11 10:33:36.503 (1715394816503732375 36348) : overwrite default thread pool for task RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX from THREAD_POOL_META_SERVER to THREAD_POOL_DEFAULT
W2024-05-11 10:33:36.503 (1715394816503775340 36348) : overwrite default thread pool for task RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX_ACK from THREAD_POOL_META_SERVER to THREAD_POOL_DEFAULT
I2024-05-11 10:33:36.503 (1715394816503863057 36348) : pegasus server starting, pid(36348), version($Version: Pegasus Server 2.6.0-SNAPSHOT (aea1cfe632d455fcddfe4c92ebbd9d4e89037abb) Release, built by gcc 7.3.1, built on 12180ab51819, built at May  7 2024 12:14:31 $)
F2024-05-11 10:36:03.558 (1715394963558260142 36428)   meta.THREAD_POOL_META_SERVER2.02008e370001000c: rpc_host_port.cpp:62:from_address(): assertion expression: [utils::hostname_from_ip(__bswap_32 (addr.ip()), &hp._host)] invalid host_port 172.17.0.1

7.By the way , all the replica-server running during that time
image

8.And I can not connect to cluster via admin-cli
image

@acelyc111
Copy link
Member

acelyc111 commented May 11, 2024

Hi, @ninsmiracle !

Is the Pegasus cluster deployed as a onebox in the docker container? Do the Pegasus shell tool and admin-cli run in the same docker container?

@ninsmiracle
Copy link
Contributor Author

ninsmiracle commented May 13, 2024

Hi, @ninsmiracle !

Is the Pegasus cluster deployed as a onebox in the docker container? Do the Pegasus shell tool and admin-cli run in the same docker container?

When I deloyed as a onebox in my Docker container , cluster run as normal. However, if I deploy it on real node, cluster running but can not accept any RPC.
I think the key point is meta.THREAD_POOL_META_SERVER2.02008e370001000c: rpc_host_port.cpp:62:from_address(): assertion expression: [utils::hostname_from_ip(__bswap_32 (addr.ip()), &hp._host)] invalid host_port 172.17.0.1.

@acelyc111
Copy link
Member

I connected to peagsus cluster via admlin-cli,such as use this command ./admin-cli -n aaa:25101,bbb:25101,but return fatal: failed to list nodes [context deadline exceeded]

It's because after the main FQDN patch has been merged, a new Thrift structure (i.e. host_port) has been introduced, but the admin-cli side dosen't know this type. You can check it in the admin-cli's shell.log, the error looks like:

time="2024-05-23T00:30:55+08:00" level=info msg="failed to read response from [127.0.0.1:34601(meta)]: *admin.ListNodesResponse error reading struct: *admin.NodeInfo error reading struct: Unknown data type 57"

The resolution is to update the admin-cli dependent go-client. However, we have to resolve #1917 at first.

@acelyc111
Copy link
Member

Hi, @ninsmiracle !
Is the Pegasus cluster deployed as a onebox in the docker container? Do the Pegasus shell tool and admin-cli run in the same docker container?

When I deloyed as a onebox in my Docker container , cluster run as normal. However, if I deploy it on real node, cluster running but can not accept any RPC. I think the key point is meta.THREAD_POOL_META_SERVER2.02008e370001000c: rpc_host_port.cpp:62:from_address(): assertion expression: [utils::hostname_from_ip(__bswap_32 (addr.ip()), &hp._host)] invalid host_port 172.17.0.1.

@ninsmiracle You can check if this patch could solve the issue: #2044

acelyc111 added a commit that referenced this issue Jun 21, 2024
#2007

In servers, we assume that the remote IPs may can't be reverse resolved, in
this case, warning or error messages are logged instead of crashing.
But in tests, we assume that all the IPs can be reverse resolved.
ruojieranyishen pushed a commit to ruojieranyishen/incubator-pegasus that referenced this issue Jul 17, 2024
…che#2044)

apache#2007

In servers, we assume that the remote IPs may can't be reverse resolved, in
this case, warning or error messages are logged instead of crashing.
But in tests, we assume that all the IPs can be reverse resolved.
@acelyc111
Copy link
Member

@ninsmiracle If it has been resolved, I'll close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants