Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 7a7d1d57 authored by David S. Miller's avatar David S. Miller
Browse files

Merge branch 'rds-enable-mprds'



Sowmini Varadhan says:

====================
RDS: TCP: Enable mprds for rds-tcp

The third, and final, installment for mprds-tcp changes.

In Patch 3 of this set, if the transport support t_mp_capable,
we hash outgoing traffic across multiple paths.  Additionally, even if
the transport is MP capable, we may be peering with some node that does
not support mprds, or supports a different number of paths. This
necessitates RDS control plane changes so that both peers agree
on the number of paths to be used for the rds-tcp connection.
Patch 3 implements all these changes, which are documented in patch 5
of the series.

Patch 1 of this series is a bug fix for a race-condition
that has always existed, but is now more easily encountered with mprds.
Patch 2 is code refactoring. Patches 4 and 5 are Documentation updates.
====================

Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents caeccd51 09204a6c
Loading
Loading
Loading
Loading
+71 −1
Original line number Original line Diff line number Diff line
@@ -85,7 +85,8 @@ Socket Interface


  bind(fd, &sockaddr_in, ...)
  bind(fd, &sockaddr_in, ...)
        This binds the socket to a local IP address and port, and a
        This binds the socket to a local IP address and port, and a
        transport.
        transport, if one has not already been selected via the
	SO_RDS_TRANSPORT socket option


  sendmsg(fd, ...)
  sendmsg(fd, ...)
        Sends a message to the indicated recipient. The kernel will
        Sends a message to the indicated recipient. The kernel will
@@ -146,6 +147,20 @@ Socket Interface
        operation. In this case, it would use RDS_CANCEL_SENT_TO to
        operation. In this case, it would use RDS_CANCEL_SENT_TO to
        nuke any pending messages.
        nuke any pending messages.


  setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
  getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
	Set or read an integer defining  the underlying
	encapsulating transport to be used for RDS packets on the
	socket. When setting the option, integer argument may be
	one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
	value, RDS_TRANS_NONE will be returned on an unbound socket.
	This socket option may only be set exactly once on the socket,
	prior to binding it via the bind(2) system call. Attempts to
	set SO_RDS_TRANSPORT on a socket for which the transport has
	been previously attached explicitly (by SO_RDS_TRANSPORT) or
	implicitly (via bind(2)) will return an error of EOPNOTSUPP.
	An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will
	always return EINVAL.


RDMA for RDS
RDMA for RDS
============
============
@@ -350,4 +365,59 @@ The recv path
    handle CMSGs
    handle CMSGs
    return to application
    return to application


Multipath RDS (mprds)
=====================
  Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
  (though the concept can be extended to other transports). The classical
  implementation of RDS-over-TCP is implemented by demultiplexing multiple
  PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
  port]) over a single TCP socket between the 2 IP addresses involved. This
  has the limitation that it ends up funneling multiple RDS flows over a
  single TCP flow, thus it is
  (a) upper-bounded to the single-flow bandwidth,
  (b) suffers from head-of-line blocking for all the RDS sockets.

  Better throughput (for a fixed small packet size, MTU) can be achieved
  by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
  RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
  connection. RDS sockets will be attached to a path based on some hash
  (e.g., of local address and RDS port number) and packets for that RDS
  socket will be sent over the attached path using TCP to segment/reassemble
  RDS datagrams on that path.

  Multipathed RDS is implemented by splitting the struct rds_connection into
  a common (to all paths) part, and a per-path struct rds_conn_path. All
  I/O workqs and reconnect threads are driven from the rds_conn_path.
  Transports such as TCP that are multipath capable may then set up a
  TPC socket per rds_conn_path, and this is managed by the transport via
  the transport privatee cp_transport_data pointer.

  Transports announce themselves as multipath capable by setting the
  t_mp_capable bit during registration with the rds core module. When the
  transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
  across multiple paths. The outgoing hash is computed based on the
  local address and port that the PF_RDS socket is bound to.

  Additionally, even if the transport is MP capable, we may be
  peering with some node that does not support mprds, or supports
  a different number of paths. As a result, the peering nodes need
  to agree on the number of paths to be used for the connection.
  This is done by sending out a control packet exchange before the
  first data packet. The control packet exchange must have completed
  prior to outgoing hash completion in rds_sendmsg() when the transport
  is mutlipath capable.

  The control packet is an RDS ping packet (i.e., packet to rds dest
  port 0) with the ping packet having a rds extension header option  of
  type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
  number of paths supported by the sender. The "probe" ping packet will
  get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
  The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
  be able to compute the min(sender_paths, rcvr_paths). The pong
  sent in response to a probe-ping should contain the rcvr's npaths
  when the rcvr is mprds-capable.

  If the rcvr is not mprds-capable, the exthdr in the ping will be
  ignored.  In this case the pong will not have any exthdrs, so the sender
  of the probe-ping can default to single-path mprds.
+6 −0
Original line number Original line Diff line number Diff line
@@ -81,6 +81,8 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port)


	if (*port != 0) {
	if (*port != 0) {
		rover = be16_to_cpu(*port);
		rover = be16_to_cpu(*port);
		if (rover == RDS_FLAG_PROBE_PORT)
			return -EINVAL;
		last = rover;
		last = rover;
	} else {
	} else {
		rover = max_t(u16, prandom_u32(), 2);
		rover = max_t(u16, prandom_u32(), 2);
@@ -91,12 +93,16 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port)
		if (rover == 0)
		if (rover == 0)
			rover++;
			rover++;


		if (rover == RDS_FLAG_PROBE_PORT)
			continue;
		key = ((u64)addr << 32) | cpu_to_be16(rover);
		key = ((u64)addr << 32) | cpu_to_be16(rover);
		if (rhashtable_lookup_fast(&bind_hash_table, &key, ht_parms))
		if (rhashtable_lookup_fast(&bind_hash_table, &key, ht_parms))
			continue;
			continue;


		rs->rs_bound_key = key;
		rs->rs_bound_key = key;
		rs->rs_bound_addr = addr;
		rs->rs_bound_addr = addr;
		net_get_random_once(&rs->rs_hash_initval,
				    sizeof(rs->rs_hash_initval));
		rs->rs_bound_port = cpu_to_be16(rover);
		rs->rs_bound_port = cpu_to_be16(rover);
		rs->rs_bound_node.next = NULL;
		rs->rs_bound_node.next = NULL;
		rds_sock_addref(rs);
		rds_sock_addref(rs);
+8 −9
Original line number Original line Diff line number Diff line
@@ -155,7 +155,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
	struct hlist_head *head = rds_conn_bucket(laddr, faddr);
	struct hlist_head *head = rds_conn_bucket(laddr, faddr);
	struct rds_transport *loop_trans;
	struct rds_transport *loop_trans;
	unsigned long flags;
	unsigned long flags;
	int ret;
	int ret, i;


	rcu_read_lock();
	rcu_read_lock();
	conn = rds_conn_lookup(net, head, laddr, faddr, trans);
	conn = rds_conn_lookup(net, head, laddr, faddr, trans);
@@ -211,6 +211,12 @@ static struct rds_connection *__rds_conn_create(struct net *net,


	conn->c_trans = trans;
	conn->c_trans = trans;


	init_waitqueue_head(&conn->c_hs_waitq);
	for (i = 0; i < RDS_MPATH_WORKERS; i++) {
		__rds_conn_path_init(conn, &conn->c_path[i],
				     is_outgoing);
		conn->c_path[i].cp_index = i;
	}
	ret = trans->conn_alloc(conn, gfp);
	ret = trans->conn_alloc(conn, gfp);
	if (ret) {
	if (ret) {
		kmem_cache_free(rds_conn_slab, conn);
		kmem_cache_free(rds_conn_slab, conn);
@@ -263,14 +269,6 @@ static struct rds_connection *__rds_conn_create(struct net *net,
			kmem_cache_free(rds_conn_slab, conn);
			kmem_cache_free(rds_conn_slab, conn);
			conn = found;
			conn = found;
		} else {
		} else {
			int i;

			for (i = 0; i < RDS_MPATH_WORKERS; i++) {
				__rds_conn_path_init(conn, &conn->c_path[i],
						     is_outgoing);
				conn->c_path[i].cp_index = i;
			}

			hlist_add_head_rcu(&conn->c_hash_node, head);
			hlist_add_head_rcu(&conn->c_hash_node, head);
			rds_cong_add_conn(conn);
			rds_cong_add_conn(conn);
			rds_conn_count++;
			rds_conn_count++;
@@ -668,6 +666,7 @@ EXPORT_SYMBOL_GPL(rds_conn_path_drop);


void rds_conn_drop(struct rds_connection *conn)
void rds_conn_drop(struct rds_connection *conn)
{
{
	WARN_ON(conn->c_trans->t_mp_capable);
	rds_conn_path_drop(&conn->c_path[0]);
	rds_conn_path_drop(&conn->c_path[0]);
}
}
EXPORT_SYMBOL_GPL(rds_conn_drop);
EXPORT_SYMBOL_GPL(rds_conn_drop);
+1 −0
Original line number Original line Diff line number Diff line
@@ -41,6 +41,7 @@ static unsigned int rds_exthdr_size[__RDS_EXTHDR_MAX] = {
[RDS_EXTHDR_VERSION]	= sizeof(struct rds_ext_header_version),
[RDS_EXTHDR_VERSION]	= sizeof(struct rds_ext_header_version),
[RDS_EXTHDR_RDMA]	= sizeof(struct rds_ext_header_rdma),
[RDS_EXTHDR_RDMA]	= sizeof(struct rds_ext_header_rdma),
[RDS_EXTHDR_RDMA_DEST]	= sizeof(struct rds_ext_header_rdma_dest),
[RDS_EXTHDR_RDMA_DEST]	= sizeof(struct rds_ext_header_rdma_dest),
[RDS_EXTHDR_NPATHS]	= sizeof(u16),
};
};




+23 −2
Original line number Original line Diff line number Diff line
@@ -85,7 +85,9 @@ enum {
#define RDS_RECV_REFILL		3
#define RDS_RECV_REFILL		3


/* Max number of multipaths per RDS connection. Must be a power of 2 */
/* Max number of multipaths per RDS connection. Must be a power of 2 */
#define	RDS_MPATH_WORKERS	1
#define	RDS_MPATH_WORKERS	8
#define	RDS_MPATH_HASH(rs, n) (jhash_1word((rs)->rs_bound_port, \
			       (rs)->rs_hash_initval) & ((n) - 1))


/* Per mpath connection state */
/* Per mpath connection state */
struct rds_conn_path {
struct rds_conn_path {
@@ -131,7 +133,8 @@ struct rds_connection {
	__be32			c_laddr;
	__be32			c_laddr;
	__be32			c_faddr;
	__be32			c_faddr;
	unsigned int		c_loopback:1,
	unsigned int		c_loopback:1,
				c_pad_to_32:31;
				c_ping_triggered:1,
				c_pad_to_32:30;
	int			c_npaths;
	int			c_npaths;
	struct rds_connection	*c_passive;
	struct rds_connection	*c_passive;
	struct rds_transport	*c_trans;
	struct rds_transport	*c_trans;
@@ -147,6 +150,7 @@ struct rds_connection {
	unsigned long		c_map_queued;
	unsigned long		c_map_queued;


	struct rds_conn_path	c_path[RDS_MPATH_WORKERS];
	struct rds_conn_path	c_path[RDS_MPATH_WORKERS];
	wait_queue_head_t	c_hs_waitq; /* handshake waitq */
};
};


static inline
static inline
@@ -166,6 +170,17 @@ void rds_conn_net_set(struct rds_connection *conn, struct net *net)
#define RDS_FLAG_RETRANSMITTED	0x04
#define RDS_FLAG_RETRANSMITTED	0x04
#define RDS_MAX_ADV_CREDIT	255
#define RDS_MAX_ADV_CREDIT	255


/* RDS_FLAG_PROBE_PORT is the reserved sport used for sending a ping
 * probe to exchange control information before establishing a connection.
 * Currently the control information that is exchanged is the number of
 * supported paths. If the peer is a legacy (older kernel revision) peer,
 * it would return a pong message without additional control information
 * that would then alert the sender that the peer was an older rev.
 */
#define RDS_FLAG_PROBE_PORT	1
#define	RDS_HS_PROBE(sport, dport) \
		((sport == RDS_FLAG_PROBE_PORT && dport == 0) || \
		 (sport == 0 && dport == RDS_FLAG_PROBE_PORT))
/*
/*
 * Maximum space available for extension headers.
 * Maximum space available for extension headers.
 */
 */
@@ -225,6 +240,11 @@ struct rds_ext_header_rdma_dest {
	__be32			h_rdma_offset;
	__be32			h_rdma_offset;
};
};


/* Extension header announcing number of paths.
 * Implicit length = 2 bytes.
 */
#define RDS_EXTHDR_NPATHS	4

#define __RDS_EXTHDR_MAX	16 /* for now */
#define __RDS_EXTHDR_MAX	16 /* for now */


struct rds_incoming {
struct rds_incoming {
@@ -545,6 +565,7 @@ struct rds_sock {
	/* Socket options - in case there will be more */
	/* Socket options - in case there will be more */
	unsigned char		rs_recverr,
	unsigned char		rs_recverr,
				rs_cong_monitor;
				rs_cong_monitor;
	u32			rs_hash_initval;
};
};


static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk)
static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk)
Loading