Skip to content

Commit

Permalink
feat(system_monitor): check UDP network errors (#9538)
Browse files Browse the repository at this point in the history
* feat(system_monitor): generalize logic for /proc/net/snmp

Signed-off-by: takeshi.iwanari <[email protected]>

* feat(system_monitor): add UDP buf errors check

Signed-off-by: takeshi.iwanari <[email protected]>

* fix calculation for errors per unit time at the first time

Signed-off-by: takeshi.iwanari <[email protected]>

* style(pre-commit): autofix

* organize code

Signed-off-by: takeshi.iwanari <[email protected]>

* style(pre-commit): autofix

* fix warnings

Signed-off-by: takeshi.iwanari <[email protected]>

* remove unnecessary fmt::format

Signed-off-by: takeshi.iwanari <[email protected]>

* organize code for metrics from /proc/net/snmp

Signed-off-by: takeshi.iwanari <[email protected]>

* style(pre-commit): autofix

* separate ROS 2 parameters from constructor

Signed-off-by: takeshi.iwanari <[email protected]>

* suppress log

Signed-off-by: takeshi.iwanari <[email protected]>

* style(pre-commit): autofix

* fix bugprone-fold-init-type

Signed-off-by: takeshi.iwanari <[email protected]>

* dummy commit to kick workflows

Signed-off-by: takeshi.iwanari <[email protected]>

---------

Signed-off-by: takeshi.iwanari <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ito-san <[email protected]>
  • Loading branch information
3 people authored Dec 20, 2024
1 parent 160e47b commit e5243e4
Show file tree
Hide file tree
Showing 6 changed files with 333 additions and 140 deletions.
1 change: 1 addition & 0 deletions system/system_monitor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ Every topic is published in 1 minute interval.
| | Network Usage |||| Notification of usage only, normally error not generated. |
| | Network CRC Error |||| Warning occurs when the number of CRC errors in the period reaches the threshold value. The number of CRC errors that occur is the same as the value that can be confirmed with the ip command. |
| | IP Packet Reassembles Failed |||| |
| | UDP Buf Errors |||| |
| NTP Monitor | NTP Offset |||| |
| Process Monitor | Tasks Summary |||| |
| | High-load Proc[0-9] |||| |
Expand Down
2 changes: 2 additions & 0 deletions system/system_monitor/config/net_monitor.param.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,5 @@
crc_error_count_threshold: 1
reassembles_failed_check_duration: 1
reassembles_failed_check_count: 1
udp_buf_errors_check_duration: 1
udp_buf_errors_check_count: 1
2 changes: 2 additions & 0 deletions system/system_monitor/docs/ros_parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@ net_monitor:
| crc_error_count_threshold | int | n/a | 1 | Generates warning when count of CRC errors during CRC error check duration reaches a specified value or higher. |
| reassembles_failed_check_duration | int | sec | 1 | IP packet reassembles failed check duration. |
| reassembles_failed_check_count | int | n/a | 1 | Generates warning when count of IP packet reassembles failed during IP packet reassembles failed check duration reaches a specified value or higher. |
| udp_buf_errors_check_duration | int | sec | 1 | UDP buf errors check duration. |
| udp_buf_errors_check_count | int | n/a | 1 | Generates warning when count of UDP buf errors during udp_buf_errors_check_duration reaches a specified value or higher. |

## <u>NTP Monitor</u>

Expand Down
20 changes: 20 additions & 0 deletions system/system_monitor/docs/topics_net_monitor.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,23 @@
| --------------------------------------- | --------------- |
| total packet reassembles failed | 0 |
| packet reassembles failed per unit time | 0 |

## <u>UDP Buf Errors</u>

/diagnostics/net_monitor: UDP Buf Errors

<b>[summary]</b>

| level | message |
| ----- | -------------- |
| OK | OK |
| WARN | UDP buf errors |

<b>[values]</b>

| key | value (example) |
| -------------------------------- | --------------- |
| total UDP rcv buf errors | 0 |
| UDP rcv buf errors per unit time | 0 |
| total UDP snd buf errors | 0 |
| UDP snd buf errors per unit time | 0 |
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,93 @@ struct CrcErrors
unsigned int last_rx_crc_errors{0}; //!< @brief rx_crc_error at the time of the last monitoring
};

/**
* @brief /proc/net/snmp information
*/
class NetSnmp
{
public:
enum class Result {
OK,
CHECK_WARNING,
READ_ERROR,
};

/**
* @brief Constructor
* @param [in] node node using this class.
*/
explicit NetSnmp(rclcpp::Node * node);

/**
* @brief Constructor
*/
NetSnmp() = delete;

/**
* @brief Copy constructor
*/
NetSnmp(const NetSnmp &) = delete;

/**
* @brief Copy assignment operator
*/
NetSnmp & operator=(const NetSnmp &) = delete;

/**
* @brief Move constructor
*/
NetSnmp(const NetSnmp &&) = delete;

/**
* @brief Move assignment operator
*/
NetSnmp & operator=(const NetSnmp &&) = delete;

/**
* @brief Set parameters for check
* @param [in] check_duration the value for check_duration
* @param [in] check_count the value for check_count
*/
void set_check_parameters(unsigned int check_duration, unsigned int check_count);

/**
* @brief Find index in `/proc/net/snmp`
* @param [in] protocol Protocol name (the first column string). e.g. "Ip:" or "Udp:"
* @param [in] metrics Metrics name. e.g. "ReasmFails"
*/
void find_index(const std::string & protocol, const std::string & metrics);

/**
* @brief Check metrics
* @param [out] current_value the value read from snmp
* @param [out] value_per_unit_time the increase of the value during the duration
* @return the result of check
*/
Result check_metrics(uint64_t & current_value, uint64_t & value_per_unit_time);

private:
/**
* @brief Read value from `/proc/net/snmp`
* @param [in] index_row row in `/proc/net/snmp`
* @param [in] index_col col in `/proc/net/snmp`
* @param [out] output_value retrieved value
* @return execution result
*/
bool read_value_from_proc(
unsigned int index_row, unsigned int index_col, uint64_t & output_value);

rclcpp::Logger logger_; //!< @brief logger gotten from user node
unsigned int check_duration_; //!< @brief check duration
unsigned int check_count_; //!< @brief check count threshold
unsigned int index_row_; //!< @brief index for the target metrics in /proc/net/snmp
unsigned int index_col_; //!< @brief index for the target metrics in /proc/net/snmp
uint64_t current_value_; //!< @brief the value read from snmp
uint64_t last_value_; //!< @brief the value read from snmp at the last monitoring
uint64_t value_per_unit_time_; //!< @brief the increase of the value during the duration
std::deque<unsigned int> queue_; //!< @brief queue that holds the delta of the value
};

namespace local = boost::asio::local;

class NetMonitor : public rclcpp::Node
Expand Down Expand Up @@ -150,6 +237,12 @@ class NetMonitor : public rclcpp::Node
*/
void check_reassembles_failed(diagnostic_updater::DiagnosticStatusWrapper & status);

/**
* @brief Check UDP buf errors
* @param [out] status diagnostic message passed directly to diagnostic publish calls
*/
void check_udp_buf_errors(diagnostic_updater::DiagnosticStatusWrapper & status);

/**
* @brief Timer callback
*/
Expand Down Expand Up @@ -273,18 +366,6 @@ class NetMonitor : public rclcpp::Node
*/
void close_connection();

/**
* @brief Get column index of IP packet reassembles failed from `/proc/net/snmp`
*/
void get_reassembles_failed_column_index();

/**
* @brief get IP packet reassembles failed
* @param [out] reassembles_failed IP packet reassembles failed
* @return execution result
*/
bool get_reassembles_failed(uint64_t & reassembles_failed);

diagnostic_updater::Updater updater_; //!< @brief Updater class which advertises to /diagnostics
rclcpp::TimerBase::SharedPtr timer_; //!< @brief timer to get Network information

Expand All @@ -307,16 +388,9 @@ class NetMonitor : public rclcpp::Node
unsigned int crc_error_check_duration_; //!< @brief CRC error check duration
unsigned int crc_error_count_threshold_; //!< @brief CRC error count threshold

std::deque<unsigned int>
reassembles_failed_queue_; //!< @brief queue that holds count of IP packet reassembles failed
uint64_t last_reassembles_failed_; //!< @brief IP packet reassembles failed at the time of the
//!< last monitoring
unsigned int
reassembles_failed_check_duration_; //!< @brief IP packet reassembles failed check duration
unsigned int
reassembles_failed_check_count_; //!< @brief IP packet reassembles failed check count threshold
unsigned int reassembles_failed_column_index_; //!< @brief column index of IP Reassembles failed
//!< in /proc/net/snmp
NetSnmp reassembles_failed_info_; //!< @brief information of IP packet reassembles failed
NetSnmp udp_rcvbuf_errors_info_; //!< @brief information of UDP rcv buf errors
NetSnmp udp_sndbuf_errors_info_; //!< @brief information of UDP snd buf errors

/**
* @brief Network connection status messages
Expand Down
Loading

0 comments on commit e5243e4

Please sign in to comment.