Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit bd265242 authored by David S. Miller's avatar David S. Miller
Browse files

Merge branch 'offload_fwd_mark'

Scott Feldman says:

====================
switchdev: avoid duplicate packet forwarding

v3:

 - Per Nicolas Dichtel review: remove errant empty union.

v2:

 - Per davem review: in sk_buff, union fwd_mark with secmark to save space
   since features appear to be mutually exclusive.
 - Per Simon Horman review:
   - fix grammar in switchdev.txt wrt fwd_mark
   - remove some unrelated changes that snuck in

v1:

This patchset was previously submitted as RFC.  No changes from the last
version (v2) sent under RFC.  Including RFC version history here for reference.

RFC v2:

 - s/fwd_mark/offload_fwd_mark
 - use consume_skb rather than kfree_skb when dropping pkt on egress.
 - Use Jiri's suggestion to use ifindex of one of the ports in a group
   as the mark for all the ports in the group.  This can be done with
   no additional storage (no hashtable from v1).  To pull it off, we
   need some simple recursive routines to walk the netdev tree ensuring
   all leaves in the tree (ports) in the same group (e.g. bridge)
   belonging to the same switch device will have the same offload fwd mark.
   Maybe someone sees a better design for the recusive routines?  They're
   not too bad, and should cover the stacked driver cases.

RFC v1:

With switchdev support for offloading L2/L3 forwarding data path to a
switch device, we have a general problem where both the device and the
kernel may forward the packet, resulting in duplicate packets on the wire.
Anytime a packet is forwarded by the device and a copy is sent to the CPU,
there is potential for duplicate forwarding, as the kernel may also do a
forwarding lookup and send the packet on the wire.

The specific problem this patch series is interested in solving is avoiding
duplicate packets on bridged ports.  There was a previous RFC from Roopa
(http://marc.info/?l=linux-netdev&m=142687073314252&w=2

) to address this
problem, but didn't solve the problem of mixed ports in the bridge from
different devices; there was no way to exclude some ports from forwarding
and include others.  This RFC solves that problem by tagging the ingressing
packet with a unique mark, and then comparing the packet mark with the
egress port mark, and skip forwarding when there is a match.  For the mixed
ports bridge case, only those ports with matching marks are skipped.

The switchdev port driver must do two things:

1) Generate a fwd_mark for each switch port, using some unique key of the
   switch device (and optionally port).  This is done when the port netdev
   is registered or if the port's group membership changes (joins/leaves
   a bridge, for example).

2) On packet ingress from port, mark the skb with the ingress port's
   fwd_mark.  If the device supports it, it's useful to only mark skbs
   which were already forwarded by the device.  If the device does not
   support such indication, all skbs can be marked, even if they're
   local dst.

Two new 32-bit fields are added to struct sk_buff and struct netdevice to
hold the fwd_mark.  I've wrapped these with CONFIG_NET_SWITCHDEV for now. I
tried using skb->mark for this purpose, but ebtables can overwrite the
skb->mark before the bridge gets it, so that will not work.

In general, this fwd_mark can be used for any case where a packet is
forwarded by the device and a copy is sent to the CPU, to avoid the kernel
re-forwarding the packet.  sFlow is another use-case that comes to mind,
but I haven't explored the details.
====================

Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents 8254973f a48037e7
Loading
Loading
Loading
Loading
+12 −2
Original line number Diff line number Diff line
@@ -279,8 +279,18 @@ and unknown unicast packets to all ports in domain, if allowed by port's
current STP state.  The switch driver, knowing which ports are within which
vlan L2 domain, can program the switch device for flooding.  The packet should
also be sent to the port netdev for processing by the bridge driver.  The
bridge should not reflood the packet to the same ports the device flooded.
XXX: the mechanism to avoid duplicate flood packets is being discuseed.
bridge should not reflood the packet to the same ports the device flooded,
otherwise there will be duplicate packets on the wire.

To avoid duplicate packets, the device/driver should mark a packet as already
forwarded using skb->offload_fwd_mark.  The same mark is set on the device
ports in the domain using dev->offload_fwd_mark.  If the skb->offload_fwd_mark
is non-zero and matches the forwarding egress port's dev->skb_mark, the kernel
will drop the skb right before transmit on the egress port, with the
understanding that the device already forwarded the packet on same egress port.
The driver can use switchdev_port_fwd_mark_set() to set a globally unique mark
for port's dev->offload_fwd_mark, based on the port's parent ID (switch ID) and
a group ifindex.

It is possible for the switch device to not handle flooding and push the
packets up to the bridge driver for flooding.  This is not ideal as the number
+11 −0
Original line number Diff line number Diff line
@@ -4822,6 +4822,7 @@ static int rocker_port_rx_proc(const struct rocker *rocker,
	const struct rocker_tlv *attrs[ROCKER_TLV_RX_MAX + 1];
	struct sk_buff *skb = rocker_desc_cookie_ptr_get(desc_info);
	size_t rx_len;
	u16 rx_flags = 0;

	if (!skb)
		return -ENOENT;
@@ -4829,6 +4830,8 @@ static int rocker_port_rx_proc(const struct rocker *rocker,
	rocker_tlv_parse_desc(attrs, ROCKER_TLV_RX_MAX, desc_info);
	if (!attrs[ROCKER_TLV_RX_FRAG_LEN])
		return -EINVAL;
	if (attrs[ROCKER_TLV_RX_FLAGS])
		rx_flags = rocker_tlv_get_u16(attrs[ROCKER_TLV_RX_FLAGS]);

	rocker_dma_rx_ring_skb_unmap(rocker, attrs);

@@ -4836,6 +4839,9 @@ static int rocker_port_rx_proc(const struct rocker *rocker,
	skb_put(skb, rx_len);
	skb->protocol = eth_type_trans(skb, rocker_port->dev);

	if (rx_flags & ROCKER_RX_FLAGS_FWD_OFFLOAD)
		skb->offload_fwd_mark = rocker_port->dev->offload_fwd_mark;

	rocker_port->dev->stats.rx_packets++;
	rocker_port->dev->stats.rx_bytes += skb->len;

@@ -4973,6 +4979,8 @@ static int rocker_probe_port(struct rocker *rocker, unsigned int port_number)
	}
	rocker->ports[port_number] = rocker_port;

	switchdev_port_fwd_mark_set(rocker_port->dev, NULL, false);

	rocker_port_set_learning(rocker_port, SWITCHDEV_TRANS_NONE);

	err = rocker_port_ig_tbl(rocker_port, SWITCHDEV_TRANS_NONE, 0);
@@ -5252,6 +5260,7 @@ static int rocker_port_bridge_join(struct rocker_port *rocker_port,
		rocker_port_internal_vlan_id_get(rocker_port, bridge->ifindex);

	rocker_port->bridge_dev = bridge;
	switchdev_port_fwd_mark_set(rocker_port->dev, bridge, true);

	return rocker_port_vlan_add(rocker_port, SWITCHDEV_TRANS_NONE,
				    untagged_vid, 0);
@@ -5272,6 +5281,8 @@ static int rocker_port_bridge_leave(struct rocker_port *rocker_port)
		rocker_port_internal_vlan_id_get(rocker_port,
						 rocker_port->dev->ifindex);

	switchdev_port_fwd_mark_set(rocker_port->dev, rocker_port->bridge_dev,
				    false);
	rocker_port->bridge_dev = NULL;

	err = rocker_port_vlan_add(rocker_port, SWITCHDEV_TRANS_NONE,
+1 −0
Original line number Diff line number Diff line
@@ -246,6 +246,7 @@ enum {
#define ROCKER_RX_FLAGS_TCP			BIT(5)
#define ROCKER_RX_FLAGS_UDP			BIT(6)
#define ROCKER_RX_FLAGS_TCP_UDP_CSUM_GOOD	BIT(7)
#define ROCKER_RX_FLAGS_FWD_OFFLOAD		BIT(8)

enum {
	ROCKER_TLV_TX_UNSPEC,
+13 −0
Original line number Diff line number Diff line
@@ -766,6 +766,13 @@ struct netdev_phys_item_id {
	unsigned char id_len;
};

static inline bool netdev_phys_item_id_same(struct netdev_phys_item_id *a,
					    struct netdev_phys_item_id *b)
{
	return a->id_len == b->id_len &&
	       memcmp(a->id, b->id, a->id_len) == 0;
}

typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
				       struct sk_buff *skb);

@@ -1456,6 +1463,8 @@ enum netdev_priv_flags {
 *
 *	@xps_maps:	XXX: need comments on this one
 *
 *	@offload_fwd_mark:	Offload device fwding mark
 *
 *	@trans_start:		Time (in jiffies) of last Tx
 *	@watchdog_timeo:	Represents the timeout that is used by
 *				the watchdog ( see dev_watchdog() )
@@ -1697,6 +1706,10 @@ struct net_device {
	struct xps_dev_maps __rcu *xps_maps;
#endif

#ifdef CONFIG_NET_SWITCHDEV
	u32			offload_fwd_mark;
#endif

	/* These may be needed for future network-power-down code. */

	/*
+8 −1
Original line number Diff line number Diff line
@@ -506,6 +506,7 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
 *	@no_fcs:  Request NIC to treat last 4 bytes as Ethernet FCS
  *	@napi_id: id of the NAPI struct this skb came from
 *	@secmark: security marking
 *	@offload_fwd_mark: fwding offload mark
 *	@mark: Generic packet mark
 *	@vlan_proto: vlan encapsulation protocol
 *	@vlan_tci: vlan tag control information
@@ -650,9 +651,15 @@ struct sk_buff {
		unsigned int	sender_cpu;
	};
#endif
	union {
#ifdef CONFIG_NETWORK_SECMARK
		__u32		secmark;
#endif
#ifdef CONFIG_NET_SWITCHDEV
		__u32		offload_fwd_mark;
#endif
	};

	union {
		__u32		mark;
		__u32		reserved_tailroom;
Loading