Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 46d3ceab authored by Eric Dumazet's avatar Eric Dumazet Committed by David S. Miller
Browse files

tcp: TCP Small Queues



This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.

Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)

I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.

If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.

[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
  but some drivers call it in their start_xmit() handler.
  These drivers should at least use BQL, or else a single TCP
  session can still fill the whole NIC TX ring, since TSQ will
  have no effect.

Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parent 2100844c
Loading
Loading
Loading
Loading
+14 −0
Original line number Diff line number Diff line
@@ -551,6 +551,20 @@ tcp_thin_dupack - BOOLEAN
	Documentation/networking/tcp-thin.txt
	Default: 0

tcp_limit_output_bytes - INTEGER
	Controls TCP Small Queue limit per tcp socket.
	TCP bulk sender tends to increase packets in flight until it
	gets losses notifications. With SNDBUF autotuning, this can
	result in a large amount of packets queued in qdisc/device
	on the local machine, hurting latency of other flows, for
	typical pfifo_fast qdiscs.
	tcp_limit_output_bytes limits the number of bytes on qdisc
	or device to reduce artificial RTT/cwnd and reduce bufferbloat.
	Note: For GSO/TSO enabled flows, we try to have at least two
	packets in flight. Reducing tcp_limit_output_bytes might also
	reduce the size of individual GSO packet (64KB being the max)
	Default: 131072

UDP variables:

udp_mem - vector of 3 INTEGERs: min, pressure, max
+9 −0
Original line number Diff line number Diff line
@@ -339,6 +339,9 @@ struct tcp_sock {
	u32	rcv_tstamp;	/* timestamp of last received ACK (for keepalives) */
	u32	lsndtime;	/* timestamp of last sent data packet (for restart window) */

	struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
	unsigned long	tsq_flags;

	/* Data for direct copy to user */
	struct {
		struct sk_buff_head	prequeue;
@@ -494,6 +497,12 @@ struct tcp_sock {
	struct tcp_cookie_values  *cookie_values;
};

enum tsq_flags {
	TSQ_THROTTLED,
	TSQ_QUEUED,
	TSQ_OWNED, /* tcp_tasklet_func() found socket was locked */
};

static inline struct tcp_sock *tcp_sk(const struct sock *sk)
{
	return (struct tcp_sock *)sk;
+2 −0
Original line number Diff line number Diff line
@@ -858,6 +858,8 @@ struct proto {
	int			(*backlog_rcv) (struct sock *sk,
						struct sk_buff *skb);

	void		(*release_cb)(struct sock *sk);

	/* Keeping track of sk's, looking them up, and port selection methods. */
	void			(*hash)(struct sock *sk);
	void			(*unhash)(struct sock *sk);
+4 −0
Original line number Diff line number Diff line
@@ -253,6 +253,7 @@ extern int sysctl_tcp_cookie_size;
extern int sysctl_tcp_thin_linear_timeouts;
extern int sysctl_tcp_thin_dupack;
extern int sysctl_tcp_early_retrans;
extern int sysctl_tcp_limit_output_bytes;

extern atomic_long_t tcp_memory_allocated;
extern struct percpu_counter tcp_sockets_allocated;
@@ -321,6 +322,8 @@ extern struct proto tcp_prot;

extern void tcp_init_mem(struct net *net);

extern void tcp_tasklet_init(void);

extern void tcp_v4_err(struct sk_buff *skb, u32);

extern void tcp_shutdown (struct sock *sk, int how);
@@ -334,6 +337,7 @@ extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
		       size_t size);
extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
			size_t size, int flags);
extern void tcp_release_cb(struct sock *sk);
extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
				 const struct tcphdr *th, unsigned int len);
+4 −0
Original line number Diff line number Diff line
@@ -2159,6 +2159,10 @@ void release_sock(struct sock *sk)
	spin_lock_bh(&sk->sk_lock.slock);
	if (sk->sk_backlog.tail)
		__release_sock(sk);

	if (sk->sk_prot->release_cb)
		sk->sk_prot->release_cb(sk);

	sk->sk_lock.owned = 0;
	if (waitqueue_active(&sk->sk_lock.wq))
		wake_up(&sk->sk_lock.wq);
Loading