Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit a88e24f2 authored by David S. Miller's avatar David S. Miller
Browse files

Merge branch 'tcp-switch-to-Early-Departure-Time-model'

Eric Dumazet says:

====================
tcp: switch to Early Departure Time model

In the early days, pacing has been implemented in sch_fq (FQ)
in a generic way :

- SO_MAX_PACING_RATE could be used by any sockets.

- TCP would vary effective pacing rate based on CWND*MSS/SRTT

- FQ would ensure delays between packets based on current
  sk->sk_pacing_rate, but with some quantum based artifacts.
  (inflating RPC tail latencies)

- BBR then tweaked the pacing rate in its various phases
  (PROBE, DRAIN, ...)

This worked reasonably well, but had the side effect that TCP RTT
samples would be inflated by the sojourn time of the packets in FQ.

Also note that when FQ is not used and TCP wants pacing, the
internal pacing fallback has very different behavior, since TCP
emits packets at the time they should be sent (with unreasonable
assumptions about scheduling costs)

Van Jacobson gave a talk at Netdev 0x12 in Montreal, about letting
TCP (or applications for UDP messages) decide of the Earliest
Departure Time, instead of letting packet schedulers derive it
from pacing rate.

https://www.netdevconf.org/0x12/session.html?evolving-from-afap-teaching-nics-about-time
https://www.files.netdevconf.org/d/46def75c2ef345809bbe/files/?p=/Evolving%20from%20AFAP%20%E2%80%93%20Teaching%20NICs%20about%20time.pdf



Recent additions in linux provided SO_TXTIME and a new ETF qdisc
supporting the new skb->tstamp role

This patch series converts TCP and FQ to the same model.

This might in the future allow us to relax tight TSQ limits
(if FQ is present in the output path), and thus lower
number of callbacks to tcp_write_xmit(), thanks to batching.

This will be followed by FQ change allowing SO_TXTIME support
so that QUIC servers can let the pacing being done in FQ (or
offloaded if network device permits)

For example, a TCP flow rated at 24Mbps now shows a more meaningful RTT

Before :

ESTAB  0  211408 10.246.7.151:41558   10.246.7.152:33723
	 cubic wscale:8,8 rto:203 rtt:2.195/0.084 mss:1448 rcvmss:536
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:36897937
  segs_out:25488 segs_in:12454 data_segs_out:25486
  send 105.5Mbps lastsnd:1 lastrcv:12851 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 22.9Mbps
  busy:12851ms unacked:4 rcv_space:29200 notsent:205616 minrtt:0.026

After :

ESTAB  0  192584 10.246.7.151:61612   10.246.7.152:34375
	 cubic wscale:8,8 rto:201 rtt:0.165/0.129 mss:1448 rcvmss:536
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:170755401
  segs_out:117931 segs_in:57651 data_segs_out:117929
  send 1404.1Mbps lastsnd:1 lastrcv:56915 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 24.2Mbps
  busy:56915ms unacked:4 rcv_space:29200 notsent:186792 minrtt:0.054

A nice side effect of this patch series is a reduction of max/p99
latencies of RPC workloads, since the FQ quantum no longer adds
artifact.
====================

Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents 4f4b93a8 90caf67b
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -689,7 +689,7 @@ struct sk_buff {

	union {
		ktime_t		tstamp;
		u64		skb_mstamp;
		u64		skb_mstamp_ns; /* earliest departure time */
	};
	/*
	 * This is the control buffer. It is free to use for every
+2 −0
Original line number Diff line number Diff line
@@ -248,6 +248,8 @@ struct tcp_sock {
		syn_smc:1;	/* SYN includes SMC */
	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */

	u64	tcp_wstamp_ns;	/* departure time for next sent data packet */

/* RTT measurement */
	u64	tcp_mstamp;	/* most recent packet received/sent */
	u32	srtt_us;	/* smoothed round trip time << 3 in usecs */
+11 −15
Original line number Diff line number Diff line
@@ -732,7 +732,7 @@ void tcp_send_window_probe(struct sock *sk);

static inline u64 tcp_clock_ns(void)
{
	return local_clock();
	return ktime_get_tai_ns();
}

static inline u64 tcp_clock_us(void)
@@ -752,17 +752,7 @@ static inline u32 tcp_time_stamp_raw(void)
	return div_u64(tcp_clock_ns(), NSEC_PER_SEC / TCP_TS_HZ);
}


/* Refresh 1us clock of a TCP socket,
 * ensuring monotically increasing values.
 */
static inline void tcp_mstamp_refresh(struct tcp_sock *tp)
{
	u64 val = tcp_clock_us();

	if (val > tp->tcp_mstamp)
		tp->tcp_mstamp = val;
}
void tcp_mstamp_refresh(struct tcp_sock *tp);

static inline u32 tcp_stamp_us_delta(u64 t1, u64 t0)
{
@@ -771,7 +761,13 @@ static inline u32 tcp_stamp_us_delta(u64 t1, u64 t0)

static inline u32 tcp_skb_timestamp(const struct sk_buff *skb)
{
	return div_u64(skb->skb_mstamp, USEC_PER_SEC / TCP_TS_HZ);
	return div_u64(skb->skb_mstamp_ns, NSEC_PER_SEC / TCP_TS_HZ);
}

/* provide the departure time in us unit */
static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb)
{
	return div_u64(skb->skb_mstamp_ns, NSEC_PER_USEC);
}


@@ -817,7 +813,7 @@ struct tcp_skb_cb {
#define TCPCB_SACKED_RETRANS	0x02	/* SKB retransmitted		*/
#define TCPCB_LOST		0x04	/* SKB is lost			*/
#define TCPCB_TAGBITS		0x07	/* All tag bits			*/
#define TCPCB_REPAIRED		0x10	/* SKB repaired (no skb_mstamp)	*/
#define TCPCB_REPAIRED		0x10	/* SKB repaired (no skb_mstamp_ns)	*/
#define TCPCB_EVER_RETRANS	0x80	/* Ever retransmitted frame	*/
#define TCPCB_RETRANS		(TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS| \
				TCPCB_REPAIRED)
@@ -1940,7 +1936,7 @@ static inline s64 tcp_rto_delta_us(const struct sock *sk)
{
	const struct sk_buff *skb = tcp_rtx_queue_head(sk);
	u32 rto = inet_csk(sk)->icsk_rto;
	u64 rto_time_stamp_us = skb->skb_mstamp + jiffies_to_usecs(rto);
	u64 rto_time_stamp_us = tcp_skb_timestamp_us(skb) + jiffies_to_usecs(rto);

	return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp;
}
+1 −1
Original line number Diff line number Diff line
@@ -88,7 +88,7 @@ u64 cookie_init_timestamp(struct request_sock *req)
		ts <<= TSBITS;
		ts |= options;
	}
	return (u64)ts * (USEC_PER_SEC / TCP_TS_HZ);
	return (u64)ts * (NSEC_PER_SEC / TCP_TS_HZ);
}


+1 −1
Original line number Diff line number Diff line
@@ -1295,7 +1295,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
			copy = size_goal;

			/* All packets are restored as if they have
			 * already been sent. skb_mstamp isn't set to
			 * already been sent. skb_mstamp_ns isn't set to
			 * avoid wrong rtt estimation.
			 */
			if (tp->repair)
Loading