Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting: Provide concise troubleshooting guidelines #55

Merged
merged 4 commits into from
Jul 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions docs/admin/troubleshooting/cfr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
(cfr)=
# CrateDB Flight Recorder (CFR)

:::{rubric} About
:::
In a similar spirit like the [](#jfr), CFR helps to collect information about
CrateDB clusters for support requests and self-service debugging.

CFR is a utility application to acquire and export diagnostic information from
CrateDB's [system tables](#systables) into an archive file. You can transmit
this file to support engineers, in order to optimally convey relevant
information about your cluster, mostly for debugging and troubleshooting
purposes.

:::{rubric} Details
:::
The CrateDB Flight Recorder (CFR) is an ETL application dumping all database
tables in the `sys` schema into a timestamped tarball archive file.
On the receiving end, the recording can be imported into another CrateDB
instance, in order to inspect and analyze it.

Flight recordings can be started against any running CrateDB cluster at runtime.
The utility connects to CrateDB like a regular client, talking SQL.
CFR is part of the CrateDB Toolkit (`ctk cfr`), and is also available as a
standalone application `cratedb-cfr(.exe)`.


## Synopsis

:Export:

`cratedb-cfr sys-export` invokes the export operation.

:Import:

`cratedb-cfr sys-import` invokes the import operation.


## Install

Select one of the standalone application bundles, matching the platform
and architecture of the corresponding system where you intend to run CFR.

::::{grid} 1 2 2 2

:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` Linux x64
:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674929097
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the approach to keep these links to release bundles up-to-date? I noticed it currently links to a specific GitHub Action run. Is there a possibility to have the artifacts as part of the regular release assets (https://github.com/crate-workbench/cratedb-toolkit/releases), so updating it here is just a matter of keeping it in sync with the latest cratedb-toolkit version number?

Copy link
Member Author

@amotl amotl Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is right on the spot. Currently, there is no solid approach yet, we just did the minimum things to establish a build matrix through a corresponding GitHub Actions workflow.

I've provided a relevant response citing you on the corresponding tracking ticket, so it will not get lost when merging this patch.

We will need to improve the situation with subsequent iterations.

:link-alt: CFR for Linux x64
:padding: 0
:class-title: sd-fs-5
+++
cratedb-cfr-linux-x64.zip
:::

:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` macOS x64
:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674929134
:link-alt: CFR for macOS x64
:padding: 0
:class-title: sd-fs-5
+++
cratedb-cfr-macos-x64.zip
:::

:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` Windows x64
:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674930132
:link-alt: CFR for Windows x64
:padding: 0
:class-title: sd-fs-5
+++
cratedb-cfr-windows-x64.zip
:::

:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` macOS ARM64
:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674927962
:link-alt: CFR for macOS ARM64
:padding: 0
:class-title: sd-fs-5
+++
cratedb-cfr-macos-arm64.zip
:::

::::



## Learn

:::{card} {material-outlined}`library_books;1.6em` CrateDB Cluster Flight Recorder (CFR)
:link: ctk:cfr
:link-type: ref
Learn about the concepts of CFR, and how to use it.
:::


[Java Flight Recorder]: https://en.wikipedia.org/wiki/JDK_Flight_Recorder
[jcmd]: https://docs.oracle.com/en/java/javase/17/docs/specs/man/jcmd.html
104 changes: 51 additions & 53 deletions docs/admin/troubleshooting/crate-node.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@

.. _use-crate-node:

===============================================
Troubleshooting with the ``crate-node`` command
===============================================
==========================
The ``crate-node`` command
==========================

This document shows you how to troubleshoot CrateDB nodes with the
`crate-node`_ command. Using this command, you can:
Use the `crate-node`_ command to troubleshoot CrateDB cluster nodes.
Using this command, you can:

* Repurpose nodes and clean up their old data
* Repurpose nodes and clean up their old data.
* Force the election of a master node (and the creation of a new cluster) in
the event that you lose too many nodes to be able to form a quorum
* Detach nodes from an old cluster so they can be moved to a new cluster
the event that you lose too many nodes to be able to form a quorum.
* Detach nodes from an old cluster so they can be moved to a new cluster.

.. rubric:: Table of contents

Expand All @@ -28,38 +28,35 @@ This document shows you how to troubleshoot CrateDB nodes with the
Repurpose a node
================

.. rubric:: About

In a situation where you have irrecoverably lost the majority of the
master-eligible nodes in a cluster, you may need to form a new cluster.

When forming a new cluster, you may have to change the `role`_ of one or more
nodes. Changing the role of a node is referred to as *repurposing* a node.

Each node checks the contents of its :ref:`data path <crate-reference:conf-env>`
at startup. If CrateDB
discovers unexpected data, it will refuse to start. Specifically:
at startup. If CrateDB discovers unexpected data, it will refuse to start.
The specific rules are:

- Nodes configured with `node.data`_ set to ``false`` will refuse to start if
they find any shard data at startup
they find any shard data at startup.

- Nodes configured with both `node.master`_ set to ``false`` and `node.data`_
set to ``false`` will refuse to start if they have any index metadata at
startup
startup.

The `crate-node`_ :ref:`repurpose command <crate-reference:cli-crate-node-commands>`
can help you clean up the necessary
node data so that CrateDB can be restarted with a new role.
can help you clean up the necessary node data, so that CrateDB can be restarted
with a new role.


Procedure
---------
.. rubric:: Procedure

To repurpose a node, first of all, you must stop the node.

Then, update the settings `node.data`_ and `node.master`_ in the ``crate.yml``
:ref:`configuration file <crate-reference:config>` as needed.

The ``node.data`` and ``node.master`` settings can be configured in four
different ways, each corresponding to a different type of node:
different ways, each corresponding to a different type of node.

+-------------------+------------------------+-----------------------------+
| Role | Configuration | After repurposing |
Expand Down Expand Up @@ -95,7 +92,7 @@ deleted (i.e., "cleaned up") after repurposing the node to that configuration.
Before running the ``repurpose`` command, make sure that any data you want
to keep is available on other nodes in the cluster.

Then, run the ``repurpose`` command:
Then, invoke the ``repurpose`` command.

.. code-block:: console

Expand All @@ -112,33 +109,36 @@ Then, run the ``repurpose`` command:
Node successfully repurposed to master and no data.

As mentioned in the command output, you can pass in ``-v`` to get a more
verbose output, like so:
verbose output.

.. code-block:: console

sh$ ./bin/crate-node repurpose -v

Finally, start the node again.

The node has been successfully repurposed.
Finally, start the node again. After that, the node has been successfully
repurposed.


.. _crate-node-unsafe-bootstrap:

Perform an unsafe cluster bootstrap
===================================

.. rubric:: About

When communication is lost between one or more nodes in a cluster (e.g., during
a *cluster partition*), the situation is assumed to be temporary and safeguards
a `network partition`_), the situation is assumed to be temporary and safeguards
exist to prevent the election of a master node unless a `quorum`_ can be
established.

However, if the situation is permanent (i.e., you have irrecoverably lost a
majority of the nodes in your cluster), you will need to force the election of
majority of the nodes in your cluster), also known as a `split-brain`_ situation,
you will need to force the election of
a master. Forcing a master election without quorum is referred to as an *unsafe
cluster bootstrap*.

The `crate-node`_ ``unsafe-bootstrap`` command can help you choose a new master
The :ref:`unsafe-bootstrap command <crate-reference:cli-crate-node-commands>`
can support you to choose a new master
node and subsequently perform an unsafe cluster bootstrap.

.. WARNING::
Expand All @@ -160,8 +160,7 @@ node and subsequently perform an unsafe cluster bootstrap.
have access to the file system.


Procedure
---------
.. rubric:: Procedure

Before you continue, you must stop all master-eligible nodes in the cluster.

Expand All @@ -175,12 +174,11 @@ Before you continue, you must stop all master-eligible nodes in the cluster.
Once all master-eligible nodes in the cluster have been stopped, you can
manually select a new master.

To help you select a new master, the ``unsafe-bootstrap`` command returns
information about the node cluster state as a pair of values in the form
*(term, version)*.

To support you selecting a new master node, the ``unsafe-bootstrap`` command
returns information about the node cluster state as a pair of values in the
form *(term, version)*.
You can gather this information (safely) by issuing the ``unsafe-bootstrap``
command and answering "no" (``n``) at the confirmation prompt, like so:
command and answering "no" (``n``) at the confirmation prompt.

.. code-block:: console

Expand Down Expand Up @@ -211,8 +209,8 @@ value, select any one of them.
that you elect a master node with the freshest state data. This, in turn,
minimizes the potential for data loss and inconsistency.

Once you have selected a node to elect to master, run the ``unsafe-bootstrap``
command on that node and answer yes (``y``) at the confirmation prompt:
Once you have selected a node to elect to master, invoke the ``unsafe-bootstrap``
command on that node and answer yes (``y``) at the confirmation prompt.

.. code-block:: console

Expand All @@ -226,46 +224,45 @@ command on that node and answer yes (``y``) at the confirmation prompt:

Confirm [y/N] y

If the operation was successful, the command will output:
If the operation was successful, the program will acknowledge it.
**Note:** This success message indicates that the operation was completed.
You may still experience data loss and inconsistencies.

.. code-block:: console

Master node was successfully bootstrapped

.. NOTE::

This success message indicates that the operation was completed. You may
still experience data loss and inconsistencies.

Start the bootstrapped node and verify that it has started a new cluster with
Now, start the bootstrapped node and verify that it has started a new cluster with
one node and elected itself as the master.

Before you can add the rest of the nodes to the new cluster, you must detach
them from the old cluster (see the :ref:`next section
<crate-node-detach-cluster>`).

When that's done, start the nodes and verify that they join the new cluster.
After that's done, start the nodes and verify that they join the new cluster.

.. NOTE::

Once the new cluster is up-and-running and all recoveries are complete, you
are responsible for assessing the cluster for data loss and
inconsistencies.
are advised to assess the database for data loss and inconsistencies.


.. _crate-node-detach-cluster:

Detach a node from its cluster
==============================

.. rubric:: About

To protect nodes from inadvertently rejoining the wrong cluster (e.g., in the
event of a network partition), each node binds to the first cluster it joins.

However, if a cluster has permanently failed (see the :ref:`previous section
<crate-node-unsafe-bootstrap>`) you must detach nodes before you can move them
to a a new cluster.

The `crate-node`_ ``detach-cluster`` command can help you move a node to a new
The :ref:`detach-cluster command <crate-reference:cli-crate-node-commands>`
supports you moving a node to a new
cluster by resetting the cluster it is bound to (i.e., *detaching* it from its
existing cluster).

Expand All @@ -278,8 +275,7 @@ existing cluster).
cluster bootstrap <crate-node-unsafe-bootstrap>`.


Procedure
---------
.. rubric:: Procedure

To detach a node, run:

Expand All @@ -293,7 +289,7 @@ To detach a node, run:

Confirm [y/N] y

You should see this:
A corresponding message confirms success.

.. code-block:: console

Expand All @@ -304,14 +300,16 @@ When the node is started again, it will be able to join a new cluster.
.. NOTE::

You may also have to update the :ref:`discovery configuration
<crate-reference:conf_discovery>` so that
<crate-reference:conf_discovery>`, so that
nodes are able to find the new cluster.


.. _crate-node: https://cratedb.com/docs/crate/reference/en/latest/cli-tools.html#cli-crate-node
.. _data path: https://cratedb.com/docs/crate/reference/en/latest/config/environment.html#application-variables
.. _network partition: https://en.wikipedia.org/wiki/Network_partition
.. _node.data: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#node-types
.. _node.master: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#node-types
.. _quorum: https://cratedb.com/docs/crate/reference/en/latest/concepts/clustering.html#master-node-election
.. _role: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#node-types
.. _split-brain: https://en.wikipedia.org/wiki/Split-brain_(computing)
.. _UUID: https://en.wikipedia.org/wiki/Universally_unique_identifier
Loading