Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(node): observability refinements. #1085

Merged
merged 2 commits into from
Jul 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 27 additions & 35 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,46 +2,38 @@

All notable changes to this project will be documented in this file.

### Observability Framework
## Unreleased

- Introduced a new observability framework utilizing the `ipc-observability` crate.
### Experimental: Observability Framework 👁️📊

- Introduced a new observability framework utilizing the `ipc-observability` crate.
- This framework introduces events and metrics for detailed system monitoring and analysis.
- Seamlessly integrates with Prometheus for real-time tracking, alerting, and visualization.
- Integrates with Prometheus for real-time tracking, alerting, and visualization.
- Simplifies observability integration with ready-to-use macros, structs, and functions.

## Added traces and metrics

- Introduced the `BlockProposalReceived` event, which tracks block proposal reception.
- Metric: `proposals_block_proposal_received` (CounterVec)
- Added the `BlockProposalSent` event, which tracks block proposal sending.
- Metric: `proposals_block_proposal_sent` (CounterVec)
- Implemented the `BlockProposalEvaluated` event, which evaluates block proposals.
- Metrics: `proposals_block_proposal_accepted` (CounterVec), `proposals_block_proposal_rejected` (CounterVec)
- Created the `BlockCommitted` event for tracking committed blocks.
- Metric: `proposals_block_committed` (CounterVec)
- Added the `MsgExec` event to represent various execution purposes such as Check, Apply, Estimate, and Call.
- Metrics: `exec_fvm_check_execution_time_secs` (Histogram), `exec_fvm_estimate_execution_time_secs` (Histogram), `exec_fvm_apply_execution_time_secs` (Histogram), `exec_fvm_call_execution_time_secs` (Histogram)
- Introduced the `CheckpointCreated` event for creating bottom-up checkpoints.
- Metrics: `bottomup_checkpoint_created_total` (IntCounter), `bottomup_checkpoint_created_height` (IntGauge), `bottomup_checkpoint_created_msgcount` (IntGauge), `bottomup_checkpoint_created_confignum` (IntGauge)
- Implemented the `CheckpointSigned` event for signing bottom-up checkpoints.
- Metric: `bottomup_checkpoint_signed_height` (IntGaugeVec)
- Added the `CheckpointFinalized` event for finalizing bottom-up checkpoints.
- Metric: `bottomup_checkpoint_finalized_height` (IntGauge)
- Created the `ParentRpcCalled` event to track parent RPC calls.
- Metrics: `topdown_parent_rpc_call_total` (IntCounterVec), `topdown_parent_rpc_call_latency_secs` (HistogramVec)
- Added the `ParentFinalityAcquired` event for acquiring parent finality.
- Metric: `topdown_parent_finality_latest_acquired_height` (IntGaugeVec)
- Implemented the `ParentFinalityPeerVoteReceived` event for receiving parent finality peer votes.
- Metric: `topdown_parent_finality_voting_latest_received_height` (IntGaugeVec)
- Created the `ParentFinalityPeerVoteSent` event for sending parent finality peer votes.
- Metric: `topdown_parent_finality_voting_latest_sent_height` (IntGauge)
- Introduced the `ParentFinalityPeerQuorumReached` event to signify quorum reach in parent finality.
- Metrics: `topdown_parent_finality_voting_quorum_height` (IntGauge), `topdown_parent_finality_voting_quorum_weight` (IntGauge)
- Added the `ParentFinalityCommitted` event to track committed parent finality.
- Metric: `topdown_parent_finality_committed_height` (IntGauge)
- Implemented the `TracingError` event to log tracing errors.
- Metric: `tracing_errors` (IntCounterVec)
IPC now emits events during execution. These events are recorded in the Journal, and are transformed into Prometheus metrics. Observability configuration is performed via `config.toml`.

Refer to full observability documentation [here](./docs/fendermint/observability.md).

### New events and metrics

| Domain | Event | Description | Metric(s) derived |
|:----------|-----------------------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Consensus | `BlockProposalReceived` | Tracks block proposal reception | `consensus_block_proposal_received_height` (IntGauge) |
| Consensus | `BlockProposalSent` | Tracks block proposal sending | `consensus_block_proposal_sent_height` (IntGauge) |
| Consensus | `BlockProposalEvaluated` | Records the result from evaluating block proposals | `consensus_block_proposal_accepted_height` (IntGauge), `consensus_block_proposal_rejected_height` (IntGauge) |
| Consensus | `BlockCommitted` | Tracks committed blocks | `consensus_block_committed_height` (IntGauge) |
| Execution | `MsgExec` | Represents various message execution paths (Check, Apply, Estimate, Call) | `exec_fvm_check_execution_time_secs` (Histogram), `exec_fvm_estimate_execution_time_secs` (Histogram), `exec_fvm_apply_execution_time_secs` (Histogram), `exec_fvm_call_execution_time_secs` (Histogram) |
| Bottomup | `CheckpointCreated` | Records checkpoint creation | `bottomup_checkpoint_created_total` (IntCounter), `bottomup_checkpoint_created_height` (IntGauge), `bottomup_checkpoint_created_msgcount` (IntGauge), `bottomup_checkpoint_created_confignum` (IntGauge) |
| Bottomup | `CheckpointSigned` | Records checkpoint signatures | `bottomup_checkpoint_signed_height` (IntGaugeVec) |
| Bottomup | `CheckpointFinalized` | Records checkpoint finalization (quorum reached) | `bottomup_checkpoint_finalized_height` (IntGauge) |
| Topdown | `ParentRpcCalled` | Tracks parent RPC calls in the context of top-down finality | `topdown_parent_rpc_call_total` (IntCounterVec), `topdown_parent_rpc_call_latency_secs` (HistogramVec) |
| Topdown | `ParentFinalityAcquired` | Records acquisition of new parent finality | `topdown_parent_finality_latest_acquired_height` (IntGaugeVec) |
| Topdown | `ParentFinalityPeerVoteReceived` | Records peer votes for parent finality | `topdown_parent_finality_voting_latest_received_height` (IntGaugeVec) |
| Topdown | `ParentFinalityPeerVoteSent` | Records own votes for parent finality | `topdown_parent_finality_voting_latest_sent_height` (IntGauge) |
| Topdown | `ParentFinalityPeerQuorumReached` | Records quorum reached in parent finality voting | `topdown_parent_finality_voting_quorum_height` (IntGauge), `topdown_parent_finality_voting_quorum_weight` (IntGauge) |
| Topdown | `ParentFinalityCommitted` | Tracks parent finality committed on chain | `topdown_parent_finality_committed_height` (IntGauge) |
| System | `TracingError` | Logs tracing errors | `tracing_errors` (IntCounterVec) |

## [axon-r01] - 2024-07-15

Expand Down
79 changes: 44 additions & 35 deletions docs/fendermint/observability.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,36 @@
# Observability Framework Documentation
# Observability

## Overview

The observability framework operates by introducing events and metrics that allow for detailed monitoring and analysis of system behavior. This is achieved through the use of the `ipc-observability` crate/library, which provides all the necessary helpers and tools to facilitate this process.
IPC's observability framework emits events throughout execution, which are recorded in a journal and transformed to Prometheus metrics.
This enables detailed monitoring and analysis of system behavior.
This is achieved through the use of the `ipc-observability` crate/library, which provides all the necessary helpers and tools to facilitate this process.

### How It Works
### How it works

1. **Events**: Specific events are defined and triggered throughout the codebase to capture significant occurrences or actions. These events encapsulate relevant data and context about what is happening within the system.
1. **Events**: Specific events are defined and triggered throughout the codebase to capture significant occurrences or actions.
These events encapsulate relevant data and context about what is happening within the system.

2. **Metrics**: Each event is associated with one or more Prometheus metrics. When an event is triggered, the corresponding metrics are updated to reflect the event's occurrence. This allows for real-time tracking and monitoring of various system activities and states.
2. **Journal**: Events are recorded in a journal, which is a rotational ledger that records chronologically ordered, timestamped trace objects to log files on disk.
The journal can also be emitted to console.

3. **Prometheus Integration**: The metrics collected are designed to integrate seamlessly with Prometheus, a powerful monitoring and alerting toolkit. Prometheus collects and stores these metrics, enabling detailed analysis and visualization through its query language and dashboarding capabilities.
3. **Metrics**: Each event is associated with one or more Prometheus metrics.
When an event is triggered, the corresponding metrics are updated to reflect the event's occurrence.
This allows for real-time tracking and monitoring of various system activities and states through dashboards and alerts.

4. **ipc-observability Crate**: This custom library encapsulates the logic and functionality required to define, trigger, and record events and metrics. It simplifies the process of adding observability to the codebase by providing ready-to-use macros, structs, and functions.
4. **Prometheus integration**: The metrics collected are designed to integrate seamlessly with Prometheus, a powerful monitoring and alerting toolkit.
Prometheus collects and stores these metrics, enabling detailed analysis and visualization through its query language and dashboarding capabilities.

### Benefits

- **Real-time Monitoring**: Enables immediate visibility into the system's performance and behavior.
- **Detailed Analysis**: Facilitates in-depth analysis of trends, anomalies, and issues.
- **Alerting**: Allows for the setup of alerts based on specific metric thresholds, ensuring timely responses to potential problems.
- **Ease of Use**: The `ipc-observability` crate simplifies the integration of observability features, reducing the effort required to instrument the code.

By leveraging this observability framework, developers can gain valuable insights into their systems, leading to improved reliability, performance, and ov
5. **ipc-observability crate**: This custom library encapsulates the logic and functionality required to define, trigger, and record events and metrics.
It simplifies the process of adding observability to the codebase by providing ready-to-use macros, structs, and functions.

## Metrics

- `proposals_block_proposal_received` (CounterVec): Incremented when a block proposal is received.
- `proposals_block_proposal_sent` (CounterVec): Incremented when a block proposal is sent.
- `proposals_block_proposal_accepted` (CounterVec): Incremented if the block proposal is accepted.
- `proposals_block_proposal_rejected` (CounterVec): Incremented if the block proposal is rejected.
- `proposals_block_committed` (CounterVec): Incremented when a block is committed.
- `consensus_block_proposal_received_height` (IntGauge): Incremented when a block proposal is received.
- `consensus_block_proposal_sent_height` (IntGauge): Incremented when a block proposal is sent.
- `consensus_block_proposal_accepted_height` (IntGauge): Incremented if the block proposal is accepted.
- `consensus_block_proposal_rejected_height` (IntGauge): Incremented if the block proposal is rejected.
- `consensus_block_committed_height` (IntGauge): Incremented when a block is committed.
- `exec_fvm_check_execution_time_secs` (Histogram): Records the execution time of FVM check in seconds.
- `exec_fvm_estimate_execution_time_secs` (Histogram): Records the execution time of FVM estimate in seconds.
- `exec_fvm_apply_execution_time_secs` (Histogram): Records the execution time of FVM apply in seconds.
Expand All @@ -50,7 +51,7 @@ By leveraging this observability framework, developers can gain valuable insight
- `topdown_parent_finality_committed_height` (IntGauge): Sets the height of the committed parent finality.
- `tracing_errors` (IntCounterVec): Increments the count of tracing errors for the affected event.

## Events and Corresponding Metrics
## Events and corresponding metrics

### BlockProposalReceived

Expand All @@ -67,7 +68,7 @@ Represents a block proposal received event.

**Affects metrics:**

- `proposals_block_proposal_received`
- `consensus_block_proposal_received_height`

### BlockProposalSent

Expand All @@ -83,7 +84,7 @@ Represents a block proposal sent event.

**Affects metrics:**

- `proposals_block_proposal_sent`
- `consensus_block_proposal_sent_height`

### BlockProposalEvaluated

Expand All @@ -102,8 +103,8 @@ Represents the evaluation of a block proposal.

**Affects metrics:**

- `proposals_block_proposal_accepted`
- `proposals_block_proposal_rejected`
- `consensus_block_proposal_accepted_height`
- `consensus_block_proposal_rejected_height`

### BlockCommitted

Expand All @@ -117,7 +118,7 @@ Represents a block committed event.

**Affects metrics:**

- `proposals_block_committed`
- `consensus_block_committed_height`

### MsgExec

Expand Down Expand Up @@ -305,29 +306,35 @@ Represents an error that occurs during tracing.

## Configuration

### Metrics Configuration
### Metrics configuration

The metrics can be configured via the configuration file for `Fendermint`. You can enable metrics and specify the listening host and port as follows:
The metrics can be configured via the `config.toml` configuration file for Fendermint. You can enable metrics and specify the listening host and port as follows:

````toml
```toml
[metrics]
enabled = true

[metrics.listen]
host = "127.0.0.1"
port = 9184
```

For Ethereum metrics, you can configure them similarly:

```toml
[eth.metrics]
enabled = true
````
```

## Tracing and journal configuration

## Tracing Configuration
> 🚧 Note: the event journal and general logs are currently output to the same file.
> We plan to segregate in the near future so that the event journal has its dedicated file.
> See this issue: https://github.com/consensus-shipyard/ipc/issues/1084.

Tracing can also be configured via the configuration file for `Fendermint`. You can set the tracing level and specify whether to log to console or file.
Tracing can also be configured via the configuration file for Fendermint. You can set the tracing level and specify whether to log to console or file.

### Console Tracing
### Console tracing

Example config:

Expand All @@ -338,7 +345,7 @@ Example config:
level = "trace" # Options: off, error, warn, info, debug, trace (default: trace)
```

### File Tracing
### File tracing

Example config:

Expand All @@ -349,7 +356,9 @@ level = "trace" # Options: off, error, warn, info, debug, trace (default: trace)
directory = "/path/to/log/directory"
max_log_files = 5 # Number of files to keep after rotation
rotation = "daily" # Options: minutely, hourly, daily, never
domain_filter = "Bottomup, Proposals, Mpool, Execution, Topdown, TracingError"
## Optional: filter events by domain
domain_filter = "Bottomup, Consenesus, Mpool, Execution, Topdown, TracingError"
## Optional: filter events by event name
events_filter = "ParentFinalityAcquired, ParentRpcCalled"
```

Expand Down
Loading
Loading