From ffcb9f6ebc6e2914ecbc27e964f7bef4730db642 Mon Sep 17 00:00:00 2001 From: Ben Pfaff Date: Wed, 9 Jan 2013 14:10:46 -0800 Subject: [PATCH] dpif: Document. Signed-off-by: Ben Pfaff --- lib/dpif.h | 307 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 305 insertions(+), 2 deletions(-) diff --git a/lib/dpif.h b/lib/dpif.h index 893338b2b..a478db2c4 100644 --- a/lib/dpif.h +++ b/lib/dpif.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2008, 2009, 2010, 2011, 2012 Nicira, Inc. + * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013 Nicira, Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -14,7 +14,310 @@ * limitations under the License. */ - +/* + * dpif, the DataPath InterFace. + * + * In Open vSwitch terminology, a "datapath" is a flow-based software switch. + * A datapath has no intelligence of its own. Rather, it relies entirely on + * its client to set up flows. The datapath layer is core to the Open vSwitch + * software switch: one could say, without much exaggeration, that everything + * in ovs-vswitchd above dpif exists only to make the correct decisions + * interacting with dpif. + * + * Typically, the client of a datapath is the software switch module in + * "ovs-vswitchd", but other clients can be written. The "ovs-dpctl" utility + * is also a (simple) client. + * + * + * Overview + * ======== + * + * The terms written in quotes below are defined in later sections. + * + * When a datapath "port" receives a packet, it extracts the headers (the + * "flow"). If the datapath's "flow table" contains a "flow entry" whose flow + * is the same as the packet's, then it executes the "actions" in the flow + * entry and increments the flow's statistics. If there is no matching flow + * entry, the datapath instead appends the packet to an "upcall" queue. + * + * + * Ports + * ===== + * + * A datapath has a set of ports that are analogous to the ports on an Ethernet + * switch. At the datapath level, each port has the following information + * associated with it: + * + * - A name, a short string that must be unique within the host. This is + * typically a name that would be familiar to the system administrator, + * e.g. "eth0" or "vif1.1", but it is otherwise arbitrary. + * + * - A 32-bit port number that must be unique within the datapath but is + * otherwise arbitrary. The port number is the most important identifier + * for a port in the datapath interface. + * + * - A type, a short string that identifies the kind of port. On a Linux + * host, typical types are "system" (for a network device such as eth0), + * "internal" (for a simulated port used to connect to the TCP/IP stack), + * and "gre" (for a GRE tunnel). + * + * - A Netlink PID (see "Upcall Queuing and Ordering" below). + * + * The dpif interface has functions for adding and deleting ports. When a + * datapath implements these (e.g. as the Linux and netdev datapaths do), then + * Open vSwitch's ovs-vswitchd daemon can directly control what ports are used + * for switching. Some datapaths might not implement them, or implement them + * with restrictions on the types of ports that can be added or removed + * (e.g. on ESX), on systems where port membership can only be changed by some + * external entity. + * + * Each datapath must have a port, sometimes called the "local port", whose + * name is the same as the datapath itself, with port number 0. The local port + * cannot be deleted. + * + * Ports are available as "struct netdev"s. To obtain a "struct netdev *" for + * a port named 'name' with type 'port_type', in a datapath of type + * 'datapath_type', call netdev_open(name, dpif_port_open_type(datapath_type, + * port_type). The netdev can be used to get and set important data related to + * the port, such as: + * + * - MTU (netdev_get_mtu(), netdev_set_mtu()). + * + * - Ethernet address (netdev_get_etheraddr(), netdev_set_etheraddr()). + * + * - Statistics such as the number of packets and bytes transmitted and + * received (netdev_get_stats()). + * + * - Carrier status (netdev_get_carrier()). + * + * - Speed (netdev_get_features()). + * + * - QoS queue configuration (netdev_get_queue(), netdev_set_queue() and + * related functions.) + * + * - Arbitrary port-specific configuration parameters (netdev_get_config(), + * netdev_set_config()). An example of such a parameter is the IP + * endpoint for a GRE tunnel. + * + * + * Flow Table + * ========== + * + * The flow table is a hash table of "flow entries". Each flow entry contains: + * + * - A "flow", that is, a summary of the headers in an Ethernet packet. The + * flow is the hash key and thus must be unique within the flow table. + * Flows are fine-grained entities that include L2, L3, and L4 headers. A + * single TCP connection consists of two flows, one in each direction. + * + * In Open vSwitch userspace, "struct flow" is the typical way to describe + * a flow, but the datapath interface uses a different data format to + * allow ABI forward- and backward-compatibility. datapath/README + * describes the rationale and design. Refer to OVS_KEY_ATTR_* and + * "struct ovs_key_*" in include/linux/openvswitch.h for details. + * lib/odp-util.h defines several functions for working with these flows. + * + * (In case you are familiar with OpenFlow, datapath flows are analogous + * to OpenFlow flow matches. The most important difference is that + * OpenFlow allows fields to be wildcarded and prioritized, whereas a + * datapath's flow table is a hash table so every flow must be + * exact-match, thus without priorities.) + * + * - A list of "actions" that tell the datapath what to do with packets + * within a flow. Some examples of actions are OVS_ACTION_ATTR_OUTPUT, + * which transmits the packet out a port, and OVS_ACTION_ATTR_SET, which + * modifies packet headers. Refer to OVS_ACTION_ATTR_* and "struct + * ovs_action_*" in include/linux/openvswitch.h for details. + * lib/odp-util.h defines several functions for working with datapath + * actions. + * + * The actions list may be empty. This indicates that nothing should be + * done to matching packets, that is, they should be dropped. + * + * (In case you are familiar with OpenFlow, datapath actions are analogous + * to OpenFlow actions.) + * + * - Statistics: the number of packets and bytes that the flow has + * processed, the last time that the flow processed a packet, and the + * union of all the TCP flags in packets processed by the flow. (The + * latter is 0 if the flow is not a TCP flow.) + * + * The datapath's client manages the flow table, primarily in reaction to + * "upcalls" (see below). + * + * + * Upcalls + * ======= + * + * A datapath sometimes needs to notify its client that a packet was received. + * The datapath mechanism to do this is called an "upcall". + * + * Upcalls are used in two situations: + * + * - When a packet is received, but there is no matching flow entry in its + * flow table (a flow table "miss"), this causes an upcall of type + * DPIF_UC_MISS. These are called "miss" upcalls. + * + * - A datapath action of type OVS_ACTION_ATTR_USERSPACE causes an upcall of + * type DPIF_UC_ACTION. These are called "action" upcalls. + * + * An upcall contains an entire packet. There is no attempt to, e.g., copy + * only as much of the packet as normally needed to make a forwarding decision. + * Such an optimization is doable, but experimental prototypes showed it to be + * of little benefit because an upcall typically contains the first packet of a + * flow, which is usually short (e.g. a TCP SYN). Also, the entire packet can + * sometimes really be needed. + * + * After a client reads a given upcall, the datapath is finished with it, that + * is, the datapath doesn't maintain any lingering state past that point. + * + * The latency from the time that a packet arrives at a port to the time that + * it is received from dpif_recv() is critical in some benchmarks. For + * example, if this latency is 1 ms, then a netperf TCP_CRR test, which opens + * and closes TCP connections one at a time as quickly as it can, cannot + * possibly achieve more than 500 transactions per second, since every + * connection consists of two flows with 1-ms latency to set up each one. + * + * To receive upcalls, a client has to enable them with dpif_recv_set(). A + * datapath should generally support multiple clients at once (e.g. so that one + * may run "ovs-dpctl show" or "ovs-dpctl dump-flows" while "ovs-vswitchd" is + * also running) but need not support multiple clients enabling upcalls at + * once. + * + * + * Upcall Queuing and Ordering + * --------------------------- + * + * The datapath's client reads upcalls one at a time by calling dpif_recv(). + * When more than one upcall is pending, the order in which the datapath + * presents upcalls to its client is important. The datapath's client does not + * directly control this order, so the datapath implementer must take care + * during design. + * + * The minimal behavior, suitable for initial testing of a datapath + * implementation, is that all upcalls are appended to a single queue, which is + * delivered to the client in order. + * + * The datapath should ensure that a high rate of upcalls from one particular + * port cannot cause upcalls from other sources to be dropped or unreasonably + * delayed. Otherwise, one port conducting a port scan or otherwise initiating + * high-rate traffic spanning many flows could suppress other traffic. + * Ideally, the datapath should present upcalls from each port in a "round + * robin" manner, to ensure fairness. + * + * The client has no control over "miss" upcalls and no insight into the + * datapath's implementation, so the datapath is entirely responsible for + * queuing and delivering them. On the other hand, the datapath has + * considerable freedom of implementation. One good approach is to maintain a + * separate queue for each port, to prevent any given port's upcalls from + * interfering with other ports' upcalls. If this is impractical, then another + * reasonable choice is to maintain some fixed number of queues and assign each + * port to one of them. Ports assigned to the same queue can then interfere + * with each other, but not with ports assigned to different queues. Other + * approaches are also possible. + * + * The client has some control over "action" upcalls: it can specify a 32-bit + * "Netlink PID" as part of the action. This terminology comes from the Linux + * datapath implementation, which uses a protocol called Netlink in which a PID + * designates a particular socket and the upcall data is delivered to the + * socket's receive queue. Generically, though, a Netlink PID identifies a + * queue for upcalls. The basic requirements on the datapath are: + * + * - The datapath must provide a Netlink PID associated with each port. The + * client can retrieve the PID with dpif_port_get_pid(). + * + * - The datapath must provide a "special" Netlink PID not associated with + * any port. dpif_port_get_pid() also provides this PID. (ovs-vswitchd + * uses this PID to queue special packets that must not be lost even if a + * port is otherwise busy, such as packets used for tunnel monitoring.) + * + * The minimal behavior of dpif_port_get_pid() and the treatment of the Netlink + * PID in "action" upcalls is that dpif_port_get_pid() returns a constant value + * and all upcalls are appended to a single queue. + * + * The ideal behavior is: + * + * - Each port has a PID that identifies the queue used for "miss" upcalls + * on that port. (Thus, if each port has its own queue for "miss" + * upcalls, then each port has a different Netlink PID.) + * + * - "miss" upcalls for a given port and "action" upcalls that specify that + * port's Netlink PID add their upcalls to the same queue. The upcalls + * are delivered to the datapath's client in the order that the packets + * were received, regardless of whether the upcalls are "miss" or "action" + * upcalls. + * + * - Upcalls that specify the "special" Netlink PID are queued separately. + * + * + * Packet Format + * ============= + * + * The datapath interface works with packets in a particular form. This is the + * form taken by packets received via upcalls (i.e. by dpif_recv()). Packets + * supplied to the datapath for processing (i.e. to dpif_execute()) also take + * this form. + * + * A VLAN tag is represented by an 802.1Q header. If the layer below the + * datapath interface uses another representation, then the datapath interface + * must perform conversion. + * + * The datapath interface requires all packets to fit within the MTU. Some + * operating systems internally process packets larger than MTU, with features + * such as TSO and UFO. When such a packet passes through the datapath + * interface, it must be broken into multiple MTU or smaller sized packets for + * presentation as upcalls. (This does not happen often, because an upcall + * typically contains the first packet of a flow, which is usually short.) + * + * Some operating system TCP/IP stacks maintain packets in an unchecksummed or + * partially checksummed state until transmission. The datapath interface + * requires all host-generated packets to be fully checksummed (e.g. IP and TCP + * checksums must be correct). On such an OS, the datapath interface must fill + * in these checksums. + * + * Packets passed through the datapath interface must be at least 14 bytes + * long, that is, they must have a complete Ethernet header. They are not + * required to be padded to the minimum Ethernet length. + * + * + * Typical Usage + * ============= + * + * Typically, the client of a datapath begins by configuring the datapath with + * a set of ports. Afterward, the client runs in a loop polling for upcalls to + * arrive. + * + * For each upcall received, the client examines the enclosed packet and + * figures out what should be done with it. For example, if the client + * implements a MAC-learning switch, then it searches the forwarding database + * for the packet's destination MAC and VLAN and determines the set of ports to + * which it should be sent. In any case, the client composes a set of datapath + * actions to properly dispatch the packet and then directs the datapath to + * execute those actions on the packet (e.g. with dpif_execute()). + * + * Most of the time, the actions that the client executed on the packet apply + * to every packet with the same flow. For example, the flow includes both + * destination MAC and VLAN ID (and much more), so this is true for the + * MAC-learning switch example above. In such a case, the client can also + * direct the datapath to treat any further packets in the flow in the same + * way, using dpif_flow_put() to add a new flow entry. + * + * Other tasks the client might need to perform, in addition to reacting to + * upcalls, include: + * + * - Periodically polling flow statistics, perhaps to supply to its own + * clients. + * + * - Deleting flow entries from the datapath that haven't been used + * recently, to save memory. + * + * - Updating flow entries whose actions should change. For example, if a + * MAC learning switch learns that a MAC has moved, then it must update + * the actions of flow entries that sent packets to the MAC at its old + * location. + * + * - Adding and removing ports to achieve a new configuration. + */ #ifndef DPIF_H #define DPIF_H 1 -- 2.43.0