Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 09204a6c authored by Sowmini Varadhan's avatar Sowmini Varadhan Committed by David S. Miller
Browse files

Documentation: RDS: Document Multipath RDS (mprds)



Document the design of mprds, covering a brief description
of the motivation, data-structures and modifications to the
RDS control plane.

Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parent d67214a2
Loading
Loading
Loading
Loading
+55 −0
Original line number Diff line number Diff line
@@ -365,4 +365,59 @@ The recv path
    handle CMSGs
    return to application

Multipath RDS (mprds)
=====================
  Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
  (though the concept can be extended to other transports). The classical
  implementation of RDS-over-TCP is implemented by demultiplexing multiple
  PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
  port]) over a single TCP socket between the 2 IP addresses involved. This
  has the limitation that it ends up funneling multiple RDS flows over a
  single TCP flow, thus it is
  (a) upper-bounded to the single-flow bandwidth,
  (b) suffers from head-of-line blocking for all the RDS sockets.

  Better throughput (for a fixed small packet size, MTU) can be achieved
  by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
  RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
  connection. RDS sockets will be attached to a path based on some hash
  (e.g., of local address and RDS port number) and packets for that RDS
  socket will be sent over the attached path using TCP to segment/reassemble
  RDS datagrams on that path.

  Multipathed RDS is implemented by splitting the struct rds_connection into
  a common (to all paths) part, and a per-path struct rds_conn_path. All
  I/O workqs and reconnect threads are driven from the rds_conn_path.
  Transports such as TCP that are multipath capable may then set up a
  TPC socket per rds_conn_path, and this is managed by the transport via
  the transport privatee cp_transport_data pointer.

  Transports announce themselves as multipath capable by setting the
  t_mp_capable bit during registration with the rds core module. When the
  transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
  across multiple paths. The outgoing hash is computed based on the
  local address and port that the PF_RDS socket is bound to.

  Additionally, even if the transport is MP capable, we may be
  peering with some node that does not support mprds, or supports
  a different number of paths. As a result, the peering nodes need
  to agree on the number of paths to be used for the connection.
  This is done by sending out a control packet exchange before the
  first data packet. The control packet exchange must have completed
  prior to outgoing hash completion in rds_sendmsg() when the transport
  is mutlipath capable.

  The control packet is an RDS ping packet (i.e., packet to rds dest
  port 0) with the ping packet having a rds extension header option  of
  type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
  number of paths supported by the sender. The "probe" ping packet will
  get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
  The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
  be able to compute the min(sender_paths, rcvr_paths). The pong
  sent in response to a probe-ping should contain the rcvr's npaths
  when the rcvr is mprds-capable.

  If the rcvr is not mprds-capable, the exthdr in the ping will be
  ignored.  In this case the pong will not have any exthdrs, so the sender
  of the probe-ping can default to single-path mprds.