Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 22b35488 authored by David S. Miller's avatar David S. Miller
Browse files

Merge branch 'xdp'

Brenden Blanco says:

====================
Add driver bpf hook for early packet drop and forwarding

This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling eXpress Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before
even an skb has been allocated.

Extend on this with the ability to modify packet data and send back out
on the same port.

Patch 1 adds an API for bulk bpf prog refcnt incrememnt.
Patch 2 introduces the new prog type and helpers for validating the bpf
  program. A new userspace struct is defined containing only data and
  data_end as fields, with others to follow in the future.
In patch 3, create a new ndo to pass the fd to supported drivers.
In patch 4, expose a new rtnl option to userspace.
In patch 5, enable support in mlx4 driver.
In patch 6, create a sample drop and count program. With single core,
  achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes
  packet data access, bpf array lookup, and increment.
In patch 7, add a page recycle facility to mlx4 rx, enabled when xdp is
  active.
In patch 8, add the XDP_TX type to bpf.h
In patch 9, add helper in tx patch for writing tx_desc
In patch 10, add support in mlx4 for packet data write and forwarding
In patch 11, turn on packet write support in the bpf verifier
In patch 12, add a sample program for packet write and forwarding. With
  single core, achieved ~10 Mpps rewrite and forwarding.

[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf



v10:
 1/12: Add bulk refcnt api.
 5/12: Move prog from priv to ring. This attribute is still only set
   globally, but the path to finer granularity should be clear. No lock
   is taken, so some rings may operate on older programs for a time (one
   napi loop). Looked into options such as napi_synchronize, but they
   were deemed too slow (calls to msleep).
   Rename prog to xdp_prog. Add xdp_ring_num to help with accounting,
   used more heavily in later patches.
 7/12: Adjust to use per-ring xdp prog. Use priv->xdp_ring_num where
   before priv->prog was used to determine buffer allocations.
 9/12: Add cpu_to_be16 to vlan_tag in mxl4_en_xmit(). Remove unused variable
   from mlx4_en_xmit and unused params from build_inline_wqe.

v9:
 4/11: Add missing newline in en_err message.
 6/11: Move page_cache cleanup from mlx4_en_destroy_rx_ring to
   mlx4_en_deactivate_rx_ring. Move mlx4_en_moderation_update back to
   static. Remove calls to mlx4_en_alloc/free_resources in mlx4_xdp_set.
   Adopt instead the approach of mlx4_en_change_mtu to use a watchdog.
 9/11: Use a per-ring function pointer in tx to separate out the code
   for regular and recycle paths of tx completion handling. Add a helper
   function to init the recycle ring and callback, called just after
   activating tx. Remove extra tx ring resource requirement, and instead
   steal from the upper rings. This helps to avoid needing
   mlx4_en_alloc_resources. Add some hopefully meaningful error
   messages for the various error cases. Reverted some of the
   hard-to-follow logic that was accounting for the extra tx rings.

v8:
 1/11: Reduce WARN_ONCE to single line. Also, change act param of that
   function to u32 to match return type of bpf_prog_run_xdp.
 2/11: Clarify locking semantics in ndo comment.
 4/11: Add en_err warning in mlx4_xdp_set on num_frags/mtu violation.

v7:
 Addressing two of the major discussion points: return codes and ndo.
 The rest will be taken as todo items for separate patches.

 Add an XDP_ABORTED type, which explicitly falls through to DROP. The
 same result must be taken for the default case as well, as it is now
 well-defined API behavior.

 Merge ndo_xdp_* into a single ndo. The style is similar to
 ndo_setup_tc, but with less unidirectional naming convention. The IFLA
 parameter names are unchanged.

 TODOs:
 Add ethtool per-ring stats for aborted, default cases, maybe even drop
 and tx as well.
 Avoid duplicate dma sync operation in XDP_PASS case as mentioned by
 Saeed.

  1/12: Add XDP_ABORTED enum, reword API comment, and update commit
   message.
  2/12: Rewrite ndo_xdp_*() into single ndo_xdp() with type/union style
    calling convention.
  3/12: Switch to ndo_xdp callback.
  4/12: Add XDP_ABORTED case as a fall-through to XDP_DROP. Implement
    ndo_xdp.
 12/12: Dropped, this will need some more work.

v6:
  2/12: drop unnecessary netif_device_present check
  4/12, 6/12, 9/12: Reorder default case statement above drop case to
    remove some copy/paste.

v5:
  0/12: Rebase and remove previous 1/13 patch
  1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed
    in future. Add bpf_warn_invalid_xdp_action() helper, to be used when
    out of bounds action is returned by the program. Add a comment to
    bpf.h denoting the undefined nature of out of bounds returns.
  2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to
    ndo_xdp_attached().
  3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy
    for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED.
  4/12: Fixup the use of READ_ONCE in the ndos. Add a user of
    bpf_warn_invalid_xdp_action helper.
  5/12: Adjust to using the nested netlink options.
  6/12: kbuild was complaining about overflow of u16 on tile
    architecture...bump frag_stride to u32. The page_offset member that
    is computed from this was already u32.

v4:
  2/12: Add inline helper for calling xdp bpf prog under rcu
  3/12: Add detail to ndo comments
  5/12: Remove mlx4_call_xdp and use inline helper instead.
  6/12: Fix checkpatch complaints
  9/12: Introduce new patch 9/12 with common helper for tx_desc write
    Refactor to use common tx_desc write helper
 11/12: Fix checkpatch complaints

v3:
  Rewrite from v2 trying to incorporate feedback from multiple sources.
  Specifically, add ability to forward packets out the same port and
    allow packet modification.
  For packet forwarding, the driver reserves a dedicated set of tx rings
    for exclusive use by xdp. Upon completion, the pages on this ring are
    recycled directly back to a small per-rx-ring page cache without
    being dma unmapped.
  Use of the percpu skb is dropped in favor of a lightweight struct
    xdp_buff. The direct packet access feature is leveraged to remove
    dependence on the skb.
  The mlx4 driver implementation allocates a page-per-packet and maps it
    in PCI_DMA_BIDIRECTIONAL mode when the bpf program is activated.
  Naming is converted to use "xdp" instead of "phys_dev".

v2:
  1/5: Drop xdp from types, instead consistently use bpf_phys_dev_
    Introduce enum for return values from phys_dev hook
  2/5: Move prog->type check to just before invoking ndo
    Change ndo to take a bpf_prog * instead of fd
    Add ndo_bpf_get rather than keeping a bool in the netdev struct
  3/5: Use ndo_bpf_get to fetch bool
  4/5: Enforce that only 1 frag is ever given to bpf prog by disallowing
    mtu to increase beyond FRAG_SZ0 when bpf prog is running, or conversely
    to set a bpf prog when priv->num_frags > 1
    Rename pseudo_skb to bpf_phys_dev_md
    Implement ndo_bpf_get
    Add dma sync just before invoking prog
    Check for explicit bpf return code rather than nonzero
    Remove increment of rx_dropped
  5/5: Use explicit bpf return code in example
    Update commit log with higher pps numbers
====================

Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents ddbcb794 764cbcce
Loading
Loading
Loading
Loading
+6 −5
Original line number Diff line number Diff line
@@ -232,7 +232,7 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size)
		}
	} else {
		ctrl = buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1));
		s = (ctrl->fence_size & 0x3f) << 4;
		s = (ctrl->qpn_vlan.fence_size & 0x3f) << 4;
		for (i = 64; i < s; i += 64) {
			wqe = buf + i;
			*wqe = cpu_to_be32(0xffffffff);
@@ -264,7 +264,7 @@ static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size)
		inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl));
	}
	ctrl->srcrb_flags = 0;
	ctrl->fence_size = size / 16;
	ctrl->qpn_vlan.fence_size = size / 16;
	/*
	 * Make sure descriptor is fully written before setting ownership bit
	 * (because HW can start executing as soon as we do).
@@ -1992,7 +1992,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
			ctrl = get_send_wqe(qp, i);
			ctrl->owner_opcode = cpu_to_be32(1 << 31);
			if (qp->sq_max_wqes_per_wr == 1)
				ctrl->fence_size = 1 << (qp->sq.wqe_shift - 4);
				ctrl->qpn_vlan.fence_size =
						1 << (qp->sq.wqe_shift - 4);

			stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift);
		}
@@ -3169,7 +3170,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
		wmb();
		*lso_wqe = lso_hdr_sz;

		ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ?
		ctrl->qpn_vlan.fence_size = (wr->send_flags & IB_SEND_FENCE ?
					     MLX4_WQE_CTRL_FENCE : 0) | size;

		/*
+8 −1
Original line number Diff line number Diff line
@@ -1722,6 +1722,12 @@ static int mlx4_en_set_channels(struct net_device *dev,
	    !channel->tx_count || !channel->rx_count)
		return -EINVAL;

	if (channel->tx_count * MLX4_EN_NUM_UP <= priv->xdp_ring_num) {
		en_err(priv, "Minimum %d tx channels required with XDP on\n",
		       priv->xdp_ring_num / MLX4_EN_NUM_UP + 1);
		return -EINVAL;
	}

	mutex_lock(&mdev->state_lock);
	if (priv->port_up) {
		port_up = 1;
@@ -1740,7 +1746,8 @@ static int mlx4_en_set_channels(struct net_device *dev,
		goto out;
	}

	netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
	netif_set_real_num_tx_queues(dev, priv->tx_ring_num -
							priv->xdp_ring_num);
	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);

	if (dev->num_tc)
+125 −0
Original line number Diff line number Diff line
@@ -31,6 +31,7 @@
 *
 */

#include <linux/bpf.h>
#include <linux/etherdevice.h>
#include <linux/tcp.h>
#include <linux/if_vlan.h>
@@ -1521,6 +1522,24 @@ static void mlx4_en_free_affinity_hint(struct mlx4_en_priv *priv, int ring_idx)
	free_cpumask_var(priv->rx_ring[ring_idx]->affinity_mask);
}

static void mlx4_en_init_recycle_ring(struct mlx4_en_priv *priv,
				      int tx_ring_idx)
{
	struct mlx4_en_tx_ring *tx_ring = priv->tx_ring[tx_ring_idx];
	int rr_index;

	rr_index = (priv->xdp_ring_num - priv->tx_ring_num) + tx_ring_idx;
	if (rr_index >= 0) {
		tx_ring->free_tx_desc = mlx4_en_recycle_tx_desc;
		tx_ring->recycle_ring = priv->rx_ring[rr_index];
		en_dbg(DRV, priv,
		       "Set tx_ring[%d]->recycle_ring = rx_ring[%d]\n",
		       tx_ring_idx, rr_index);
	} else {
		tx_ring->recycle_ring = NULL;
	}
}

int mlx4_en_start_port(struct net_device *dev)
{
	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -1643,6 +1662,8 @@ int mlx4_en_start_port(struct net_device *dev)
		}
		tx_ring->tx_queue = netdev_get_tx_queue(dev, i);

		mlx4_en_init_recycle_ring(priv, i);

		/* Arm CQ for TX completions */
		mlx4_en_arm_cq(priv, cq);

@@ -2112,6 +2133,11 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
		en_err(priv, "Bad MTU size:%d.\n", new_mtu);
		return -EPERM;
	}
	if (priv->xdp_ring_num && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
		en_err(priv, "MTU size:%d requires frags but XDP running\n",
		       new_mtu);
		return -EOPNOTSUPP;
	}
	dev->mtu = new_mtu;

	if (netif_running(dev)) {
@@ -2520,6 +2546,103 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
	return err;
}

static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
{
	struct mlx4_en_priv *priv = netdev_priv(dev);
	struct mlx4_en_dev *mdev = priv->mdev;
	struct bpf_prog *old_prog;
	int xdp_ring_num;
	int port_up = 0;
	int err;
	int i;

	xdp_ring_num = prog ? ALIGN(priv->rx_ring_num, MLX4_EN_NUM_UP) : 0;

	/* No need to reconfigure buffers when simply swapping the
	 * program for a new one.
	 */
	if (priv->xdp_ring_num == xdp_ring_num) {
		if (prog) {
			prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
			if (IS_ERR(prog))
				return PTR_ERR(prog);
		}
		for (i = 0; i < priv->rx_ring_num; i++) {
			/* This xchg is paired with READ_ONCE in the fastpath */
			old_prog = xchg(&priv->rx_ring[i]->xdp_prog, prog);
			if (old_prog)
				bpf_prog_put(old_prog);
		}
		return 0;
	}

	if (priv->num_frags > 1) {
		en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
		return -EOPNOTSUPP;
	}

	if (priv->tx_ring_num < xdp_ring_num + MLX4_EN_NUM_UP) {
		en_err(priv,
		       "Minimum %d tx channels required to run XDP\n",
		       (xdp_ring_num + MLX4_EN_NUM_UP) / MLX4_EN_NUM_UP);
		return -EINVAL;
	}

	if (prog) {
		prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
		if (IS_ERR(prog))
			return PTR_ERR(prog);
	}

	mutex_lock(&mdev->state_lock);
	if (priv->port_up) {
		port_up = 1;
		mlx4_en_stop_port(dev, 1);
	}

	priv->xdp_ring_num = xdp_ring_num;
	netif_set_real_num_tx_queues(dev, priv->tx_ring_num -
							priv->xdp_ring_num);

	for (i = 0; i < priv->rx_ring_num; i++) {
		old_prog = xchg(&priv->rx_ring[i]->xdp_prog, prog);
		if (old_prog)
			bpf_prog_put(old_prog);
	}

	if (port_up) {
		err = mlx4_en_start_port(dev);
		if (err) {
			en_err(priv, "Failed starting port %d for XDP change\n",
			       priv->port);
			queue_work(mdev->workqueue, &priv->watchdog_task);
		}
	}

	mutex_unlock(&mdev->state_lock);
	return 0;
}

static bool mlx4_xdp_attached(struct net_device *dev)
{
	struct mlx4_en_priv *priv = netdev_priv(dev);

	return !!priv->xdp_ring_num;
}

static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
{
	switch (xdp->command) {
	case XDP_SETUP_PROG:
		return mlx4_xdp_set(dev, xdp->prog);
	case XDP_QUERY_PROG:
		xdp->prog_attached = mlx4_xdp_attached(dev);
		return 0;
	default:
		return -EINVAL;
	}
}

static const struct net_device_ops mlx4_netdev_ops = {
	.ndo_open		= mlx4_en_open,
	.ndo_stop		= mlx4_en_close,
@@ -2548,6 +2671,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
	.ndo_udp_tunnel_del	= mlx4_en_del_vxlan_port,
	.ndo_features_check	= mlx4_en_features_check,
	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
	.ndo_xdp		= mlx4_xdp,
};

static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2584,6 +2708,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
	.ndo_udp_tunnel_del	= mlx4_en_del_vxlan_port,
	.ndo_features_check	= mlx4_en_features_check,
	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
	.ndo_xdp		= mlx4_xdp,
};

struct mlx4_en_bond {
+112 −12
Original line number Diff line number Diff line
@@ -32,6 +32,7 @@
 */

#include <net/busy_poll.h>
#include <linux/bpf.h>
#include <linux/mlx4/cq.h>
#include <linux/slab.h>
#include <linux/mlx4/qp.h>
@@ -57,7 +58,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
	struct page *page;
	dma_addr_t dma;

	for (order = MLX4_EN_ALLOC_PREFER_ORDER; ;) {
	for (order = frag_info->order; ;) {
		gfp_t gfp = _gfp;

		if (order)
@@ -70,7 +71,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
			return -ENOMEM;
	}
	dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
			   PCI_DMA_FROMDEVICE);
			   frag_info->dma_dir);
	if (dma_mapping_error(priv->ddev, dma)) {
		put_page(page);
		return -ENOMEM;
@@ -124,7 +125,8 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
	while (i--) {
		if (page_alloc[i].page != ring_alloc[i].page) {
			dma_unmap_page(priv->ddev, page_alloc[i].dma,
				page_alloc[i].page_size, PCI_DMA_FROMDEVICE);
				page_alloc[i].page_size,
				priv->frag_info[i].dma_dir);
			page = page_alloc[i].page;
			/* Revert changes done by mlx4_alloc_pages */
			page_ref_sub(page, page_alloc[i].page_size /
@@ -145,7 +147,7 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,

	if (next_frag_end > frags[i].page_size)
		dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
			       PCI_DMA_FROMDEVICE);
			       frag_info->dma_dir);

	if (frags[i].page)
		put_page(frags[i].page);
@@ -176,7 +178,8 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,

		page_alloc = &ring->page_alloc[i];
		dma_unmap_page(priv->ddev, page_alloc->dma,
			       page_alloc->page_size, PCI_DMA_FROMDEVICE);
			       page_alloc->page_size,
			       priv->frag_info[i].dma_dir);
		page = page_alloc->page;
		/* Revert changes done by mlx4_alloc_pages */
		page_ref_sub(page, page_alloc->page_size /
@@ -201,7 +204,7 @@ static void mlx4_en_destroy_allocator(struct mlx4_en_priv *priv,
		       i, page_count(page_alloc->page));

		dma_unmap_page(priv->ddev, page_alloc->dma,
				page_alloc->page_size, PCI_DMA_FROMDEVICE);
				page_alloc->page_size, frag_info->dma_dir);
		while (page_alloc->page_offset + frag_info->frag_stride <
		       page_alloc->page_size) {
			put_page(page_alloc->page);
@@ -244,6 +247,12 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
	struct mlx4_en_rx_alloc *frags = ring->rx_info +
					(index << priv->log_rx_info);

	if (ring->page_cache.index > 0) {
		frags[0] = ring->page_cache.buf[--ring->page_cache.index];
		rx_desc->data[0].addr = cpu_to_be64(frags[0].dma);
		return 0;
	}

	return mlx4_en_alloc_frags(priv, rx_desc, frags, ring->page_alloc, gfp);
}

@@ -502,6 +511,24 @@ void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
	}
}

/* When the rx ring is running in page-per-packet mode, a released frame can go
 * directly into a small cache, to avoid unmapping or touching the page
 * allocator. In bpf prog performance scenarios, buffers are either forwarded
 * or dropped, never converted to skbs, so every page can come directly from
 * this cache when it is sized to be a multiple of the napi budget.
 */
bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
			struct mlx4_en_rx_alloc *frame)
{
	struct mlx4_en_page_cache *cache = &ring->page_cache;

	if (cache->index >= MLX4_EN_CACHE_SIZE)
		return false;

	cache->buf[cache->index++] = *frame;
	return true;
}

void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
			     struct mlx4_en_rx_ring **pring,
			     u32 size, u16 stride)
@@ -509,6 +536,8 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
	struct mlx4_en_dev *mdev = priv->mdev;
	struct mlx4_en_rx_ring *ring = *pring;

	if (ring->xdp_prog)
		bpf_prog_put(ring->xdp_prog);
	mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
	vfree(ring->rx_info);
	ring->rx_info = NULL;
@@ -522,6 +551,16 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
				struct mlx4_en_rx_ring *ring)
{
	int i;

	for (i = 0; i < ring->page_cache.index; i++) {
		struct mlx4_en_rx_alloc *frame = &ring->page_cache.buf[i];

		dma_unmap_page(priv->ddev, frame->dma, frame->page_size,
			       priv->frag_info[0].dma_dir);
		put_page(frame->page);
	}
	ring->page_cache.index = 0;
	mlx4_en_free_rx_buf(priv, ring);
	if (ring->stride <= TXBB_SIZE)
		ring->buf -= TXBB_SIZE;
@@ -743,7 +782,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
	struct mlx4_en_rx_alloc *frags;
	struct mlx4_en_rx_desc *rx_desc;
	struct bpf_prog *xdp_prog;
	int doorbell_pending;
	struct sk_buff *skb;
	int tx_index;
	int index;
	int nr;
	unsigned int length;
@@ -759,6 +801,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
	if (budget <= 0)
		return polled;

	xdp_prog = READ_ONCE(ring->xdp_prog);
	doorbell_pending = 0;
	tx_index = (priv->tx_ring_num - priv->xdp_ring_num) + cq->ring;

	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
	 * descriptor offset can be deduced from the CQE index instead of
	 * reading 'cqe->index' */
@@ -835,6 +881,43 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));

		/* A bpf program gets first chance to drop the packet. It may
		 * read bytes but not past the end of the frag.
		 */
		if (xdp_prog) {
			struct xdp_buff xdp;
			dma_addr_t dma;
			u32 act;

			dma = be64_to_cpu(rx_desc->data[0].addr);
			dma_sync_single_for_cpu(priv->ddev, dma,
						priv->frag_info[0].frag_size,
						DMA_FROM_DEVICE);

			xdp.data = page_address(frags[0].page) +
							frags[0].page_offset;
			xdp.data_end = xdp.data + length;

			act = bpf_prog_run_xdp(xdp_prog, &xdp);
			switch (act) {
			case XDP_PASS:
				break;
			case XDP_TX:
				if (!mlx4_en_xmit_frame(frags, dev,
							length, tx_index,
							&doorbell_pending))
					goto consumed;
				break;
			default:
				bpf_warn_invalid_xdp_action(act);
			case XDP_ABORTED:
			case XDP_DROP:
				if (mlx4_en_rx_recycle(ring, frags))
					goto consumed;
				goto next;
			}
		}

		if (likely(dev->features & NETIF_F_RXCSUM)) {
			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
						      MLX4_CQE_STATUS_UDP)) {
@@ -986,6 +1069,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
		for (nr = 0; nr < priv->num_frags; nr++)
			mlx4_en_free_frag(priv, frags, nr);

consumed:
		++cq->mcq.cons_index;
		index = (cq->mcq.cons_index) & ring->size_mask;
		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
@@ -994,6 +1078,9 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
	}

out:
	if (doorbell_pending)
		mlx4_en_xmit_doorbell(priv->tx_ring[tx_index]);

	AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
	mlx4_cq_set_ci(&cq->mcq);
	wmb(); /* ensure HW sees CQ consumer before we post new buffers */
@@ -1061,22 +1148,35 @@ static const int frag_sizes[] = {

void mlx4_en_calc_rx_buf(struct net_device *dev)
{
	enum dma_data_direction dma_dir = PCI_DMA_FROMDEVICE;
	struct mlx4_en_priv *priv = netdev_priv(dev);
	/* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
	 * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
	 */
	int eff_mtu = dev->mtu + ETH_HLEN + (2 * VLAN_HLEN);
	int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
	int order = MLX4_EN_ALLOC_PREFER_ORDER;
	u32 align = SMP_CACHE_BYTES;
	int buf_size = 0;
	int i = 0;

	/* bpf requires buffers to be set up as 1 packet per page.
	 * This only works when num_frags == 1.
	 */
	if (priv->xdp_ring_num) {
		dma_dir = PCI_DMA_BIDIRECTIONAL;
		/* This will gain efficient xdp frame recycling at the expense
		 * of more costly truesize accounting
		 */
		align = PAGE_SIZE;
		order = 0;
	}

	while (buf_size < eff_mtu) {
		priv->frag_info[i].order = order;
		priv->frag_info[i].frag_size =
			(eff_mtu > buf_size + frag_sizes[i]) ?
				frag_sizes[i] : eff_mtu - buf_size;
		priv->frag_info[i].frag_prefix_size = buf_size;
		priv->frag_info[i].frag_stride =
				ALIGN(priv->frag_info[i].frag_size,
				      SMP_CACHE_BYTES);
				ALIGN(priv->frag_info[i].frag_size, align);
		priv->frag_info[i].dma_dir = dma_dir;
		buf_size += priv->frag_info[i].frag_size;
		i++;
	}
+210 −64
Original line number Diff line number Diff line
@@ -196,6 +196,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
	ring->last_nr_txbb = 1;
	memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
	memset(ring->buf, 0, ring->buf_size);
	ring->free_tx_desc = mlx4_en_free_tx_desc;

	ring->qp_state = MLX4_QP_STATE_RST;
	ring->doorbell_qpn = cpu_to_be32(ring->qp.qpn << 8);
@@ -265,7 +266,7 @@ static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
}


static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
			 struct mlx4_en_tx_ring *ring,
			 int index, u8 owner, u64 timestamp,
			 int napi_mode)
@@ -344,6 +345,27 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
	return tx_info->nr_txbb;
}

u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
			    struct mlx4_en_tx_ring *ring,
			    int index, u8 owner, u64 timestamp,
			    int napi_mode)
{
	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
	struct mlx4_en_rx_alloc frame = {
		.page = tx_info->page,
		.dma = tx_info->map0_dma,
		.page_offset = 0,
		.page_size = PAGE_SIZE,
	};

	if (!mlx4_en_rx_recycle(ring->recycle_ring, &frame)) {
		dma_unmap_page(priv->ddev, tx_info->map0_dma,
			       PAGE_SIZE, priv->frag_info[0].dma_dir);
		put_page(tx_info->page);
	}

	return tx_info->nr_txbb;
}

int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
{
@@ -362,7 +384,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
	}

	while (ring->cons != ring->prod) {
		ring->last_nr_txbb = mlx4_en_free_tx_desc(priv, ring,
		ring->last_nr_txbb = ring->free_tx_desc(priv, ring,
						ring->cons & ring->size_mask,
						!!(ring->cons & ring->size), 0,
						0 /* Non-NAPI caller */);
@@ -444,7 +466,7 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
				timestamp = mlx4_en_get_cqe_ts(cqe);

			/* free next descriptor */
			last_nr_txbb = mlx4_en_free_tx_desc(
			last_nr_txbb = ring->free_tx_desc(
					priv, ring, ring_index,
					!!((ring_cons + txbbs_skipped) &
					ring->size), timestamp, napi_budget);
@@ -476,6 +498,9 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
	ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
	ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;

	if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
		return done < budget;

	netdev_tx_completed_queue(ring->tx_queue, packets, bytes);

	/* Wakeup Tx queue if this stopped, and ring is not full.
@@ -631,8 +656,7 @@ static int get_real_size(const struct sk_buff *skb,
static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
			     const struct sk_buff *skb,
			     const struct skb_shared_info *shinfo,
			     int real_size, u16 *vlan_tag,
			     int tx_ind, void *fragptr)
			     void *fragptr)
{
	struct mlx4_wqe_inline_seg *inl = &tx_desc->inl;
	int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl;
@@ -700,10 +724,66 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
	__iowrite64_copy(dst, src, bytecnt / 8);
}

void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
{
	wmb();
	/* Since there is no iowrite*_native() that writes the
	 * value as is, without byteswapping - using the one
	 * the doesn't do byteswapping in the relevant arch
	 * endianness.
	 */
#if defined(__LITTLE_ENDIAN)
	iowrite32(
#else
	iowrite32be(
#endif
		  ring->doorbell_qpn,
		  ring->bf.uar->map + MLX4_SEND_DOORBELL);
}

static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
				  struct mlx4_en_tx_desc *tx_desc,
				  union mlx4_wqe_qpn_vlan qpn_vlan,
				  int desc_size, int bf_index,
				  __be32 op_own, bool bf_ok,
				  bool send_doorbell)
{
	tx_desc->ctrl.qpn_vlan = qpn_vlan;

	if (bf_ok) {
		op_own |= htonl((bf_index & 0xffff) << 8);
		/* Ensure new descriptor hits memory
		 * before setting ownership of this descriptor to HW
		 */
		dma_wmb();
		tx_desc->ctrl.owner_opcode = op_own;

		wmb();

		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
			     desc_size);

		wmb();

		ring->bf.offset ^= ring->bf.buf_size;
	} else {
		/* Ensure new descriptor hits memory
		 * before setting ownership of this descriptor to HW
		 */
		dma_wmb();
		tx_desc->ctrl.owner_opcode = op_own;
		if (send_doorbell)
			mlx4_en_xmit_doorbell(ring);
		else
			ring->xmit_more++;
	}
}

netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
{
	struct skb_shared_info *shinfo = skb_shinfo(skb);
	struct mlx4_en_priv *priv = netdev_priv(dev);
	union mlx4_wqe_qpn_vlan	qpn_vlan = {};
	struct device *ddev = priv->ddev;
	struct mlx4_en_tx_ring *ring;
	struct mlx4_en_tx_desc *tx_desc;
@@ -715,7 +795,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
	int real_size;
	u32 index, bf_index;
	__be32 op_own;
	u16 vlan_tag = 0;
	u16 vlan_proto = 0;
	int i_frag;
	int lso_header_size;
@@ -725,6 +804,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
	bool stop_queue;
	bool inline_ok;
	u32 ring_cons;
	bool bf_ok;

	tx_ind = skb_get_queue_mapping(skb);
	ring = priv->tx_ring[tx_ind];
@@ -749,9 +829,17 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
		goto tx_drop;
	}

	bf_ok = ring->bf_enabled;
	if (skb_vlan_tag_present(skb)) {
		vlan_tag = skb_vlan_tag_get(skb);
		qpn_vlan.vlan_tag = cpu_to_be16(skb_vlan_tag_get(skb));
		vlan_proto = be16_to_cpu(skb->vlan_proto);
		if (vlan_proto == ETH_P_8021AD)
			qpn_vlan.ins_vlan = MLX4_WQE_CTRL_INS_SVLAN;
		else if (vlan_proto == ETH_P_8021Q)
			qpn_vlan.ins_vlan = MLX4_WQE_CTRL_INS_CVLAN;
		else
			qpn_vlan.ins_vlan = 0;
		bf_ok = false;
	}

	netdev_txq_bql_enqueue_prefetchw(ring->tx_queue);
@@ -771,6 +859,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
	else {
		tx_desc = (struct mlx4_en_tx_desc *) ring->bounce_buf;
		bounce = true;
		bf_ok = false;
	}

	/* Save skb in tx_info ring */
@@ -907,8 +996,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);

	if (tx_info->inl)
		build_inline_wqe(tx_desc, skb, shinfo, real_size, &vlan_tag,
				 tx_ind, fragptr);
		build_inline_wqe(tx_desc, skb, shinfo, fragptr);

	if (skb->encapsulation) {
		union {
@@ -946,60 +1034,15 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)

	real_size = (real_size / 16) & 0x3f;

	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
	    !skb_vlan_tag_present(skb) && send_doorbell) {
		tx_desc->ctrl.bf_qpn = ring->doorbell_qpn |
				       cpu_to_be32(real_size);

		op_own |= htonl((bf_index & 0xffff) << 8);
		/* Ensure new descriptor hits memory
		 * before setting ownership of this descriptor to HW
		 */
		dma_wmb();
		tx_desc->ctrl.owner_opcode = op_own;

		wmb();
	bf_ok &= desc_size <= MAX_BF && send_doorbell;

		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
			     desc_size);

		wmb();

		ring->bf.offset ^= ring->bf.buf_size;
	} else {
		tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
		if (vlan_proto == ETH_P_8021AD)
			tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_SVLAN;
		else if (vlan_proto == ETH_P_8021Q)
			tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_CVLAN;
	if (bf_ok)
		qpn_vlan.bf_qpn = ring->doorbell_qpn | cpu_to_be32(real_size);
	else
			tx_desc->ctrl.ins_vlan = 0;
		qpn_vlan.fence_size = real_size;

		tx_desc->ctrl.fence_size = real_size;

		/* Ensure new descriptor hits memory
		 * before setting ownership of this descriptor to HW
		 */
		dma_wmb();
		tx_desc->ctrl.owner_opcode = op_own;
		if (send_doorbell) {
			wmb();
			/* Since there is no iowrite*_native() that writes the
			 * value as is, without byteswapping - using the one
			 * the doesn't do byteswapping in the relevant arch
			 * endianness.
			 */
#if defined(__LITTLE_ENDIAN)
			iowrite32(
#else
			iowrite32be(
#endif
				  ring->doorbell_qpn,
				  ring->bf.uar->map + MLX4_SEND_DOORBELL);
		} else {
			ring->xmit_more++;
		}
	}
	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
			      op_own, bf_ok, send_doorbell);

	if (unlikely(stop_queue)) {
		/* If queue was emptied after the if (stop_queue) , and before
@@ -1034,3 +1077,106 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
	return NETDEV_TX_OK;
}

netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_alloc *frame,
			       struct net_device *dev, unsigned int length,
			       int tx_ind, int *doorbell_pending)
{
	struct mlx4_en_priv *priv = netdev_priv(dev);
	union mlx4_wqe_qpn_vlan	qpn_vlan = {};
	struct mlx4_en_tx_ring *ring;
	struct mlx4_en_tx_desc *tx_desc;
	struct mlx4_wqe_data_seg *data;
	struct mlx4_en_tx_info *tx_info;
	int index, bf_index;
	bool send_doorbell;
	int nr_txbb = 1;
	bool stop_queue;
	dma_addr_t dma;
	int real_size;
	__be32 op_own;
	u32 ring_cons;
	bool bf_ok;

	BUILD_BUG_ON_MSG(ALIGN(CTRL_SIZE + DS_SIZE, TXBB_SIZE) != TXBB_SIZE,
			 "mlx4_en_xmit_frame requires minimum size tx desc");

	ring = priv->tx_ring[tx_ind];

	if (!priv->port_up)
		goto tx_drop;

	if (mlx4_en_is_tx_ring_full(ring))
		goto tx_drop;

	/* fetch ring->cons far ahead before needing it to avoid stall */
	ring_cons = READ_ONCE(ring->cons);

	index = ring->prod & ring->size_mask;
	tx_info = &ring->tx_info[index];

	bf_ok = ring->bf_enabled;

	/* Track current inflight packets for performance analysis */
	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
			 (u32)(ring->prod - ring_cons - 1));

	bf_index = ring->prod;
	tx_desc = ring->buf + index * TXBB_SIZE;
	data = &tx_desc->data;

	dma = frame->dma;

	tx_info->page = frame->page;
	frame->page = NULL;
	tx_info->map0_dma = dma;
	tx_info->map0_byte_count = length;
	tx_info->nr_txbb = nr_txbb;
	tx_info->nr_bytes = max_t(unsigned int, length, ETH_ZLEN);
	tx_info->data_offset = (void *)data - (void *)tx_desc;
	tx_info->ts_requested = 0;
	tx_info->nr_maps = 1;
	tx_info->linear = 1;
	tx_info->inl = 0;

	dma_sync_single_for_device(priv->ddev, dma, length, PCI_DMA_TODEVICE);

	data->addr = cpu_to_be64(dma);
	data->lkey = ring->mr_key;
	dma_wmb();
	data->byte_count = cpu_to_be32(length);

	/* tx completion can avoid cache line miss for common cases */
	tx_desc->ctrl.srcrb_flags = priv->ctrl_flags;

	op_own = cpu_to_be32(MLX4_OPCODE_SEND) |
		((ring->prod & ring->size) ?
		 cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);

	ring->packets++;
	ring->bytes += tx_info->nr_bytes;
	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);

	ring->prod += nr_txbb;

	stop_queue = mlx4_en_is_tx_ring_full(ring);
	send_doorbell = stop_queue ||
				*doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
	bf_ok &= send_doorbell;

	real_size = ((CTRL_SIZE + nr_txbb * DS_SIZE) / 16) & 0x3f;

	if (bf_ok)
		qpn_vlan.bf_qpn = ring->doorbell_qpn | cpu_to_be32(real_size);
	else
		qpn_vlan.fence_size = real_size;

	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
			      op_own, bf_ok, send_doorbell);
	*doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;

	return NETDEV_TX_OK;

tx_drop:
	ring->tx_dropped++;
	return NETDEV_TX_BUSY;
}
Loading