Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit c4cde580 authored by David S. Miller's avatar David S. Miller
Browse files


Daniel Borkmann says:

====================
pull-request: bpf-next 2019-07-03

The following pull-request contains BPF updates for your *net-next* tree.

There is a minor merge conflict in mlx5 due to 8960b389 ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:

** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
  static int mlx5e_open_cq(struct mlx5e_channel *c,
                           struct dim_cq_moder moder,
                           struct mlx5e_cq_param *param,
                           struct mlx5e_cq *cq)
  =======
  int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
                    struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
  >>>>>>> e5a3e259

Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...

  drivers/net/ethernet/mellanox/mlx5/core/en.h +977

... and in mlx5e_open_xsk() ...

  drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64

... needs the same rename from net_dim_cq_moder into dim_cq_moder.

** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
          int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
          struct dim_cq_moder icocq_moder = {0, 0};
          struct net_device *netdev = priv->netdev;
          struct mlx5e_channel *c;
          unsigned int irq;
  =======
          struct net_dim_cq_moder icocq_moder = {0, 0};
  >>>>>>> e5a3e259

Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.

Let me know if you run into any issues. Anyway, the main changes are:

1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.

2) Addition of two new per-cgroup BPF hooks for getsockopt and
   setsockopt along with a new sockopt program type which allows more
   fine-grained pass/reject settings for containers. Also add a sock_ops
   callback that can be selectively enabled on a per-socket basis and is
   executed for every RTT to help tracking TCP statistics, both features
   from Stanislav.

3) Follow-up fix from loops in precision tracking which was not propagating
   precision marks and as a result verifier assumed that some branches were
   not taken and therefore wrongly removed as dead code, from Alexei.

4) Fix BPF cgroup release synchronization race which could lead to a
   double-free if a leaf's cgroup_bpf object is released and a new BPF
   program is attached to the one of ancestor cgroups in parallel, from Roman.

5) Support for bulking XDP_TX on veth devices which improves performance
   in some cases by around 9%, from Toshiaki.

6) Allow for lookups into BPF devmap and improve feedback when calling into
   bpf_redirect_map() as lookup is now performed right away in the helper
   itself, from Toke.

7) Add support for fq's Earliest Departure Time to the Host Bandwidth
   Manager (HBM) sample BPF program, from Lawrence.

8) Various cleanups and minor fixes all over the place from many others.
====================

Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents e2c74694 e5a3e259
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -42,6 +42,7 @@ Program types
.. toctree::
   :maxdepth: 1

   prog_cgroup_sockopt
   prog_cgroup_sysctl
   prog_flow_dissector

+93 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

============================
BPF_PROG_TYPE_CGROUP_SOCKOPT
============================

``BPF_PROG_TYPE_CGROUP_SOCKOPT`` program type can be attached to two
cgroup hooks:

* ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt``
  system call.
* ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt``
  system call.

The context (``struct bpf_sockopt``) has associated socket (``sk``) and
all input arguments: ``level``, ``optname``, ``optval`` and ``optlen``.

BPF_CGROUP_SETSOCKOPT
=====================

``BPF_CGROUP_SETSOCKOPT`` is triggered *before* the kernel handling of
sockopt and it has writable context: it can modify the supplied arguments
before passing them down to the kernel. This hook has access to the cgroup
and socket local storage.

If BPF program sets ``optlen`` to -1, the control will be returned
back to the userspace after all other BPF programs in the cgroup
chain finish (i.e. kernel ``setsockopt`` handling will *not* be executed).

Note, that ``optlen`` can not be increased beyond the user-supplied
value. It can only be decreased or set to -1. Any other value will
trigger ``EFAULT``.

Return Type
-----------

* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
* ``1`` - success, continue with next BPF program in the cgroup chain.

BPF_CGROUP_GETSOCKOPT
=====================

``BPF_CGROUP_GETSOCKOPT`` is triggered *after* the kernel handing of
sockopt. The BPF hook can observe ``optval``, ``optlen`` and ``retval``
if it's interested in whatever kernel has returned. BPF hook can override
the values above, adjust ``optlen`` and reset ``retval`` to 0. If ``optlen``
has been increased above initial ``getsockopt`` value (i.e. userspace
buffer is too small), ``EFAULT`` is returned.

This hook has access to the cgroup and socket local storage.

Note, that the only acceptable value to set to ``retval`` is 0 and the
original value that the kernel returned. Any other value will trigger
``EFAULT``.

Return Type
-----------

* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
* ``1`` - success: copy ``optval`` and ``optlen`` to userspace, return
  ``retval`` from the syscall (note that this can be overwritten by
  the BPF program from the parent cgroup).

Cgroup Inheritance
==================

Suppose, there is the following cgroup hierarchy where each cgroup
has ``BPF_CGROUP_GETSOCKOPT`` attached at each level with
``BPF_F_ALLOW_MULTI`` flag::

  A (root, parent)
   \
    B (child)

When the application calls ``getsockopt`` syscall from the cgroup B,
the programs are executed from the bottom up: B, A. First program
(B) sees the result of kernel's ``getsockopt``. It can optionally
adjust ``optval``, ``optlen`` and reset ``retval`` to 0. After that
control will be passed to the second (A) program which will see the
same context as B including any potential modifications.

Same for ``BPF_CGROUP_SETSOCKOPT``: if the program is attached to
A and B, the trigger order is B, then A. If B does any changes
to the input arguments (``level``, ``optname``, ``optval``, ``optlen``),
then the next program in the chain (A) will see those changes,
*not* the original input ``setsockopt`` arguments. The potentially
modified values will be then passed down to the kernel.

Example
=======

See ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example
of BPF program that handles socket options.
+15 −1
Original line number Diff line number Diff line
@@ -220,7 +220,21 @@ Usage
In order to use AF_XDP sockets there are two parts needed. The
user-space application and the XDP program. For a complete setup and
usage example, please refer to the sample application. The user-space
side is xdpsock_user.c and the XDP side xdpsock_kern.c.
side is xdpsock_user.c and the XDP side is part of libbpf.

The XDP code sample included in tools/lib/bpf/xsk.c is the following::

   SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
   {
       int index = ctx->rx_queue_index;

       // A set entry here means that the correspnding queue_id
       // has an active AF_XDP socket bound to it.
       if (bpf_map_lookup_elem(&xsks_map, &index))
           return bpf_redirect_map(&xsks_map, index, 0);

       return XDP_PASS;
   }

Naive ring dequeue and enqueue could look like this::

+7 −5
Original line number Diff line number Diff line
@@ -641,8 +641,8 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
	struct i40e_tx_desc *tx_desc = NULL;
	struct i40e_tx_buffer *tx_bi;
	bool work_done = true;
	struct xdp_desc desc;
	dma_addr_t dma;
	u32 len;

	while (budget-- > 0) {
		if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
@@ -651,21 +651,23 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
			break;
		}

		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc))
			break;

		dma_sync_single_for_device(xdp_ring->dev, dma, len,
		dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr);

		dma_sync_single_for_device(xdp_ring->dev, dma, desc.len,
					   DMA_BIDIRECTIONAL);

		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
		tx_bi->bytecount = len;
		tx_bi->bytecount = desc.len;

		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
		tx_desc->buffer_addr = cpu_to_le64(dma);
		tx_desc->cmd_type_offset_bsz =
			build_ctob(I40E_TX_DESC_CMD_ICRC
				   | I40E_TX_DESC_CMD_EOP,
				   0, len, 0);
				   0, desc.len, 0);

		xdp_ring->next_to_use++;
		if (xdp_ring->next_to_use == xdp_ring->count)
+9 −6
Original line number Diff line number Diff line
@@ -571,8 +571,9 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
	union ixgbe_adv_tx_desc *tx_desc = NULL;
	struct ixgbe_tx_buffer *tx_bi;
	bool work_done = true;
	u32 len, cmd_type;
	struct xdp_desc desc;
	dma_addr_t dma;
	u32 cmd_type;

	while (budget-- > 0) {
		if (unlikely(!ixgbe_desc_unused(xdp_ring)) ||
@@ -581,14 +582,16 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
			break;
		}

		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc))
			break;

		dma_sync_single_for_device(xdp_ring->dev, dma, len,
		dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr);

		dma_sync_single_for_device(xdp_ring->dev, dma, desc.len,
					   DMA_BIDIRECTIONAL);

		tx_bi = &xdp_ring->tx_buffer_info[xdp_ring->next_to_use];
		tx_bi->bytecount = len;
		tx_bi->bytecount = desc.len;
		tx_bi->xdpf = NULL;
		tx_bi->gso_segs = 1;

@@ -599,10 +602,10 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
		cmd_type = IXGBE_ADVTXD_DTYP_DATA |
			   IXGBE_ADVTXD_DCMD_DEXT |
			   IXGBE_ADVTXD_DCMD_IFCS;
		cmd_type |= len | IXGBE_TXD_CMD;
		cmd_type |= desc.len | IXGBE_TXD_CMD;
		tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);
		tx_desc->read.olinfo_status =
			cpu_to_le32(len << IXGBE_ADVTXD_PAYLEN_SHIFT);
			cpu_to_le32(desc.len << IXGBE_ADVTXD_PAYLEN_SHIFT);

		xdp_ring->next_to_use++;
		if (xdp_ring->next_to_use == xdp_ring->count)
Loading