Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 0739d24d authored by David S. Miller's avatar David S. Miller
Browse files

Merge branch 'devlink-health'



Eran Ben Elisha says:

====================
Devlink health reporting and recovery system

The health mechanism is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
  information.

The main idea is to unify and centralize driver health reports in the
generic devlink instance and allow the user to set different
attributes of the health reporting and recovery procedures.

The devlink health reporter:
Device driver creates a "health reporter" per each error/health type.
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by devlink.
Device driver can provide specific callbacks for each "health reporter", e.g.
 - Recovery procedures
 - Diagnostics and object dump procedures
 - OOB initial attributes
Different parts of the driver can register different types of health reporters
with different handlers.

Once an error is reported, devlink health will do the following actions:
  * A log is being send to the kernel trace events buffer
  * Health status and statistics are being updated for the reporter instance
  * Object dump is being taken and saved at the reporter instance (as long as
    there is no other dump which is already stored)
  * Auto recovery attempt is being done. Depends on:
    - Auto-recovery configuration
    - Grace period vs. time passed since last recover

The user interface:
User can access/change each reporter attributes and driver specific callbacks
via devlink, e.g per error type (per health reporter)
 - Configure reporter's generic attributes (like: Disable/enable auto recovery)
 - Invoke recovery procedure
 - Run diagnostics
 - Object dump

The devlink health interface (via netlink):
DEVLINK_CMD_HEALTH_REPORTER_GET
  Retrieves status and configuration info per DEV and reporter.
DEVLINK_CMD_HEALTH_REPORTER_SET
  Allows reporter-related configuration setting.
DEVLINK_CMD_HEALTH_REPORTER_RECOVER
  Triggers a reporter's recovery procedure.
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
  Retrieves diagnostics data from a reporter on a device.
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
  Retrieves the last stored dump. Devlink health
  saves a single dump. If an dump is not already stored by the devlink
  for this reporter, devlink generates a new dump.
  dump output is defined by the reporter.
DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
  Clears the last saved dump file for the specified reporter.

                                               netlink
                                      +--------------------------+
                                      |                          |
                                      |            +             |
                                      |            |             |
                                      +--------------------------+
                                                   |request for ops
                                                   |(diagnose,
 mlx5_core                             devlink     |recover,
                                                   |dump)
+--------+                            +--------------------------+
|        |                            |    reporter|             |
|        |                            |  +---------v----------+  |
|        |   ops execution            |  |                    |  |
|     <----------------------------------+                    |  |
|        |                            |  |                    |  |
|        |                            |  + ^------------------+  |
|        |                            |    | request for ops     |
|        |                            |    | (recover, dump)     |
|        |                            |    |                     |
|        |                            |  +-+------------------+  |
|        |     health report          |  | health handler     |  |
|        +------------------------------->                    |  |
|        |                            |  +--------------------+  |
|        |     health reporter create |                          |
|        +---------------------------->                          |
+--------+                            +--------------------------+

In this patchset, mlx5e TX reporter is implemented.

Cmdline format:
    devlink health show [DEV reporter REPORTE_NAME]
    devlink health recover DEV reporter REPORTER_NAME
    devlink health diagnose DEV reporter REPORTER_NAME
    devlink health dump show DEV reporter REPORTER_NAME
    devlink health dump clear DEV reporter REPORTER_NAME
    devlink health set DEV reporter REPORTER_NAME NAME VALUE

Cmdline examples:
$devlink health show
pci/0000:00:09.0:
  name tx
    state healthy #err 1 #recover 0 last_dump_ts N/A
    parameters:
      grace_period 500 auto_recover false

$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
    "SQs": [ {
            "sqn": 138,
            "HW state": 1,
            "stopped": false
        },{
            "sqn": 142,
            "HW state": 1,
            "stopped": false
        } ]
}

$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
  sqn: 138 HW state: 1 stopped: false
  sqn: 142 HW state: 1 stopped: false

$devlink health recover pci/0000:00:09 reporter tx

$devlink health set pci/0000:00:09.0 reporter tx grace_period 3500

$devlink health set pci/0000:00:09.0 reporter tx auto_recover false

Changelog:
v4:
- Rebase on latest net-next
- Remove trace_devlink_health signature exposure in case CONFIG_NET_DEVLINK is
  not defined as it shall only be used from devlink.

v3:
- Redesign of devlink <-> driver fmsg API
- Various bug fixes

v2:
- Remove FW* reporters to decrease the amount of patches in the patchset
====================

Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents 8f289805 db2ab7a0
Loading
Loading
Loading
Loading
+86 −0
Original line number Diff line number Diff line
The health mechanism is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
  information.

The main idea is to unify and centralize driver health reports in the
generic devlink instance and allow the user to set different
attributes of the health reporting and recovery procedures.

The devlink health reporter:
Device driver creates a "health reporter" per each error/health type.
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by devlink.
Device driver can provide specific callbacks for each "health reporter", e.g.
 - Recovery procedures
 - Diagnostics and object dump procedures
 - OOB initial parameters
Different parts of the driver can register different types of health reporters
with different handlers.

Once an error is reported, devlink health will do the following actions:
  * A log is being send to the kernel trace events buffer
  * Health status and statistics are being updated for the reporter instance
  * Object dump is being taken and saved at the reporter instance (as long as
    there is no other dump which is already stored)
  * Auto recovery attempt is being done. Depends on:
    - Auto-recovery configuration
    - Grace period vs. time passed since last recover

The user interface:
User can access/change each reporter's parameters and driver specific callbacks
via devlink, e.g per error type (per health reporter)
 - Configure reporter's generic parameters (like: disable/enable auto recovery)
 - Invoke recovery procedure
 - Run diagnostics
 - Object dump

The devlink health interface (via netlink):
DEVLINK_CMD_HEALTH_REPORTER_GET
  Retrieves status and configuration info per DEV and reporter.
DEVLINK_CMD_HEALTH_REPORTER_SET
  Allows reporter-related configuration setting.
DEVLINK_CMD_HEALTH_REPORTER_RECOVER
  Triggers a reporter's recovery procedure.
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
  Retrieves diagnostics data from a reporter on a device.
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
  Retrieves the last stored dump. Devlink health
  saves a single dump. If an dump is not already stored by the devlink
  for this reporter, devlink generates a new dump.
  dump output is defined by the reporter.
DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
  Clears the last saved dump file for the specified reporter.


                                               netlink
                                      +--------------------------+
                                      |                          |
                                      |            +             |
                                      |            |             |
                                      +--------------------------+
                                                   |request for ops
                                                   |(diagnose,
 mlx5_core                             devlink     |recover,
                                                   |dump)
+--------+                            +--------------------------+
|        |                            |    reporter|             |
|        |                            |  +---------v----------+  |
|        |   ops execution            |  |                    |  |
|     <----------------------------------+                    |  |
|        |                            |  |                    |  |
|        |                            |  + ^------------------+  |
|        |                            |    | request for ops     |
|        |                            |    | (recover, dump)     |
|        |                            |    |                     |
|        |                            |  +-+------------------+  |
|        |     health report          |  | health handler     |  |
|        +------------------------------->                    |  |
|        |                            |  +--------------------+  |
|        |     health reporter create |                          |
|        +---------------------------->                          |
+--------+                            +--------------------------+
+1 −1
Original line number Diff line number Diff line
@@ -22,7 +22,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
#
mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
		en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \
		en_selftest.o en/port.o en/monitor_stats.o
		en_selftest.o en/port.o en/monitor_stats.o en/reporter_tx.o

#
# Netdev extra
+14 −4
Original line number Diff line number Diff line
@@ -387,10 +387,7 @@ struct mlx5e_txqsq {
	struct mlx5e_channel      *channel;
	int                        txq_ix;
	u32                        rate_limit;
	struct mlx5e_txqsq_recover {
	struct work_struct         recover_work;
		u64                        last_recover;
	} recover;
} ____cacheline_aligned_in_smp;

struct mlx5e_dma_info {
@@ -683,6 +680,13 @@ struct mlx5e_rss_params {
	u8	hfunc;
};

struct mlx5e_modify_sq_param {
	int curr_state;
	int next_state;
	int rl_update;
	int rl_index;
};

struct mlx5e_priv {
	/* priv data path fields - start */
	struct mlx5e_txqsq *txq2sq[MLX5E_MAX_NUM_CHANNELS * MLX5E_MAX_NUM_TC];
@@ -738,6 +742,7 @@ struct mlx5e_priv {
#ifdef CONFIG_MLX5_EN_TLS
	struct mlx5e_tls          *tls;
#endif
	struct devlink_health_reporter *tx_reporter;
};

struct mlx5e_profile {
@@ -868,6 +873,11 @@ void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params);
void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
			       struct mlx5e_params *params);

int mlx5e_modify_sq(struct mlx5_core_dev *mdev, u32 sqn,
		    struct mlx5e_modify_sq_param *p);
void mlx5e_activate_txqsq(struct mlx5e_txqsq *sq);
void mlx5e_tx_disable_queue(struct netdev_queue *txq);

static inline bool mlx5e_tunnel_inner_ft_supported(struct mlx5_core_dev *mdev)
{
	return (MLX5_CAP_ETH(mdev, tunnel_stateless_gre) &&
+15 −0
Original line number Diff line number Diff line
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2019 Mellanox Technologies. */

#ifndef __MLX5E_EN_REPORTER_H
#define __MLX5E_EN_REPORTER_H

#include <linux/mlx5/driver.h>
#include "en.h"

int mlx5e_tx_reporter_create(struct mlx5e_priv *priv);
void mlx5e_tx_reporter_destroy(struct mlx5e_priv *priv);
void mlx5e_tx_reporter_err_cqe(struct mlx5e_txqsq *sq);
int mlx5e_tx_reporter_timeout(struct mlx5e_txqsq *sq);

#endif
+297 −0
Original line number Diff line number Diff line
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2019 Mellanox Technologies. */

#include <net/devlink.h>
#include "reporter.h"
#include "lib/eq.h"

#define MLX5E_TX_REPORTER_PER_SQ_MAX_LEN 256

struct mlx5e_tx_err_ctx {
	int (*recover)(struct mlx5e_txqsq *sq);
	struct mlx5e_txqsq *sq;
};

static int mlx5e_wait_for_sq_flush(struct mlx5e_txqsq *sq)
{
	unsigned long exp_time = jiffies + msecs_to_jiffies(2000);

	while (time_before(jiffies, exp_time)) {
		if (sq->cc == sq->pc)
			return 0;

		msleep(20);
	}

	netdev_err(sq->channel->netdev,
		   "Wait for SQ 0x%x flush timeout (sq cc = 0x%x, sq pc = 0x%x)\n",
		   sq->sqn, sq->cc, sq->pc);

	return -ETIMEDOUT;
}

static void mlx5e_reset_txqsq_cc_pc(struct mlx5e_txqsq *sq)
{
	WARN_ONCE(sq->cc != sq->pc,
		  "SQ 0x%x: cc (0x%x) != pc (0x%x)\n",
		  sq->sqn, sq->cc, sq->pc);
	sq->cc = 0;
	sq->dma_fifo_cc = 0;
	sq->pc = 0;
}

static int mlx5e_sq_to_ready(struct mlx5e_txqsq *sq, int curr_state)
{
	struct mlx5_core_dev *mdev = sq->channel->mdev;
	struct net_device *dev = sq->channel->netdev;
	struct mlx5e_modify_sq_param msp = {0};
	int err;

	msp.curr_state = curr_state;
	msp.next_state = MLX5_SQC_STATE_RST;

	err = mlx5e_modify_sq(mdev, sq->sqn, &msp);
	if (err) {
		netdev_err(dev, "Failed to move sq 0x%x to reset\n", sq->sqn);
		return err;
	}

	memset(&msp, 0, sizeof(msp));
	msp.curr_state = MLX5_SQC_STATE_RST;
	msp.next_state = MLX5_SQC_STATE_RDY;

	err = mlx5e_modify_sq(mdev, sq->sqn, &msp);
	if (err) {
		netdev_err(dev, "Failed to move sq 0x%x to ready\n", sq->sqn);
		return err;
	}

	return 0;
}

static int mlx5e_tx_reporter_err_cqe_recover(struct mlx5e_txqsq *sq)
{
	struct mlx5_core_dev *mdev = sq->channel->mdev;
	struct net_device *dev = sq->channel->netdev;
	u8 state;
	int err;

	if (!test_bit(MLX5E_SQ_STATE_RECOVERING, &sq->state))
		return 0;

	err = mlx5_core_query_sq_state(mdev, sq->sqn, &state);
	if (err) {
		netdev_err(dev, "Failed to query SQ 0x%x state. err = %d\n",
			   sq->sqn, err);
		return err;
	}

	if (state != MLX5_SQC_STATE_ERR) {
		netdev_err(dev, "SQ 0x%x not in ERROR state\n", sq->sqn);
		return -EINVAL;
	}

	mlx5e_tx_disable_queue(sq->txq);

	err = mlx5e_wait_for_sq_flush(sq);
	if (err)
		return err;

	/* At this point, no new packets will arrive from the stack as TXQ is
	 * marked with QUEUE_STATE_DRV_XOFF. In addition, NAPI cleared all
	 * pending WQEs. SQ can safely reset the SQ.
	 */

	err = mlx5e_sq_to_ready(sq, state);
	if (err)
		return err;

	mlx5e_reset_txqsq_cc_pc(sq);
	sq->stats->recover++;
	mlx5e_activate_txqsq(sq);

	return 0;
}

void mlx5e_tx_reporter_err_cqe(struct mlx5e_txqsq *sq)
{
	char err_str[MLX5E_TX_REPORTER_PER_SQ_MAX_LEN];
	struct mlx5e_tx_err_ctx err_ctx = {0};

	err_ctx.sq       = sq;
	err_ctx.recover  = mlx5e_tx_reporter_err_cqe_recover;
	sprintf(err_str, "ERR CQE on SQ: 0x%x", sq->sqn);

	devlink_health_report(sq->channel->priv->tx_reporter, err_str,
			      &err_ctx);
}

static int mlx5e_tx_reporter_timeout_recover(struct mlx5e_txqsq *sq)
{
	struct mlx5_eq_comp *eq = sq->cq.mcq.eq;
	u32 eqe_count;
	int ret;

	netdev_err(sq->channel->netdev, "EQ 0x%x: Cons = 0x%x, irqn = 0x%x\n",
		   eq->core.eqn, eq->core.cons_index, eq->core.irqn);

	eqe_count = mlx5_eq_poll_irq_disabled(eq);
	ret = eqe_count ? true : false;
	if (!eqe_count) {
		clear_bit(MLX5E_SQ_STATE_ENABLED, &sq->state);
		return ret;
	}

	netdev_err(sq->channel->netdev, "Recover %d eqes on EQ 0x%x\n",
		   eqe_count, eq->core.eqn);
	sq->channel->stats->eq_rearm++;
	return ret;
}

int mlx5e_tx_reporter_timeout(struct mlx5e_txqsq *sq)
{
	char err_str[MLX5E_TX_REPORTER_PER_SQ_MAX_LEN];
	struct mlx5e_tx_err_ctx err_ctx;

	err_ctx.sq       = sq;
	err_ctx.recover  = mlx5e_tx_reporter_timeout_recover;
	sprintf(err_str,
		"TX timeout on queue: %d, SQ: 0x%x, CQ: 0x%x, SQ Cons: 0x%x SQ Prod: 0x%x, usecs since last trans: %u\n",
		sq->channel->ix, sq->sqn, sq->cq.mcq.cqn, sq->cc, sq->pc,
		jiffies_to_usecs(jiffies - sq->txq->trans_start));

	return devlink_health_report(sq->channel->priv->tx_reporter, err_str,
				     &err_ctx);
}

/* state lock cannot be grabbed within this function.
 * It can cause a dead lock or a read-after-free.
 */
int mlx5e_tx_reporter_recover_from_ctx(struct mlx5e_tx_err_ctx *err_ctx)
{
	return err_ctx->recover(err_ctx->sq);
}

static int mlx5e_tx_reporter_recover_all(struct mlx5e_priv *priv)
{
	int err;

	rtnl_lock();
	mutex_lock(&priv->state_lock);
	mlx5e_close_locked(priv->netdev);
	err = mlx5e_open_locked(priv->netdev);
	mutex_unlock(&priv->state_lock);
	rtnl_unlock();

	return err;
}

static int mlx5e_tx_reporter_recover(struct devlink_health_reporter *reporter,
				     void *context)
{
	struct mlx5e_priv *priv = devlink_health_reporter_priv(reporter);
	struct mlx5e_tx_err_ctx *err_ctx = context;

	return err_ctx ? mlx5e_tx_reporter_recover_from_ctx(err_ctx) :
			 mlx5e_tx_reporter_recover_all(priv);
}

static int
mlx5e_tx_reporter_build_diagnose_output(struct devlink_fmsg *fmsg,
					u32 sqn, u8 state, bool stopped)
{
	int err;

	err = devlink_fmsg_obj_nest_start(fmsg);
	if (err)
		return err;

	err = devlink_fmsg_u32_pair_put(fmsg, "sqn", sqn);
	if (err)
		return err;

	err = devlink_fmsg_u8_pair_put(fmsg, "HW state", state);
	if (err)
		return err;

	err = devlink_fmsg_bool_pair_put(fmsg, "stopped", stopped);
	if (err)
		return err;

	err = devlink_fmsg_obj_nest_end(fmsg);
	if (err)
		return err;

	return 0;
}

static int mlx5e_tx_reporter_diagnose(struct devlink_health_reporter *reporter,
				      struct devlink_fmsg *fmsg)
{
	struct mlx5e_priv *priv = devlink_health_reporter_priv(reporter);
	int i, err = 0;

	mutex_lock(&priv->state_lock);

	if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
		goto unlock;

	err = devlink_fmsg_arr_pair_nest_start(fmsg, "SQs");
	if (err)
		goto unlock;

	for (i = 0; i < priv->channels.num * priv->channels.params.num_tc;
	     i++) {
		struct mlx5e_txqsq *sq = priv->txq2sq[i];
		u8 state;

		err = mlx5_core_query_sq_state(priv->mdev, sq->sqn, &state);
		if (err)
			break;

		err = mlx5e_tx_reporter_build_diagnose_output(fmsg, sq->sqn,
							      state,
							      netif_xmit_stopped(sq->txq));
		if (err)
			break;
	}
	err = devlink_fmsg_arr_pair_nest_end(fmsg);
	if (err)
		goto unlock;

unlock:
	mutex_unlock(&priv->state_lock);
	return err;
}

static const struct devlink_health_reporter_ops mlx5_tx_reporter_ops = {
		.name = "tx",
		.recover = mlx5e_tx_reporter_recover,
		.diagnose = mlx5e_tx_reporter_diagnose,
};

#define MLX5_REPORTER_TX_GRACEFUL_PERIOD 500

int mlx5e_tx_reporter_create(struct mlx5e_priv *priv)
{
	struct mlx5_core_dev *mdev = priv->mdev;
	struct devlink *devlink = priv_to_devlink(mdev);

	priv->tx_reporter =
		devlink_health_reporter_create(devlink, &mlx5_tx_reporter_ops,
					       MLX5_REPORTER_TX_GRACEFUL_PERIOD,
					       true, priv);
	if (IS_ERR_OR_NULL(priv->tx_reporter))
		netdev_warn(priv->netdev,
			    "Failed to create tx reporter, err = %ld\n",
			    PTR_ERR(priv->tx_reporter));
	return PTR_ERR_OR_ZERO(priv->tx_reporter);
}

void mlx5e_tx_reporter_destroy(struct mlx5e_priv *priv)
{
	if (IS_ERR_OR_NULL(priv->tx_reporter))
		return;

	devlink_health_reporter_destroy(priv->tx_reporter);
}
Loading