Please Note: This website has been archived and is no longer maintained.
See the Open Networking Foundation for current OpenFlow-related information.

Multipath Proposal

From OpenFlow Wiki

Jump to: navigation, search

Contents

Forwarding Groups Proposal

"All problems in computer science can be solved by another level of indirection... Except for the problem of too many layers of indirection." [1]

The ability for one port to point to a group of other ports enables OpenFlow to represent additional methods of forwarding (ex. multipath, fast reroute, and link aggregation), as well as to handle current methods (ex. multicast) more efficiently.

Short Description

This proposal adds a layer of port indirection, starting with a new OpenFlow message, GROUP_MOD, to manage a logical port as a "group of action buckets." A number of group types are defined, each with varying forwarding behavior. For example, multipath forwarding groups add the ability to send packets to one port out of a set of choices. If a flow entry forwards to a multipath logical port, the actions in one action bucket will be applied to each matching packet. Another type is a fast reroute group; the conditions for when to transition between reroute groups outside of the OpenFlow spec. See below for others.

Switches should support chained forwarding. Though this may complicate a HAL implementation, it provides the ability to efficiently change groups of forwarding entries that share the same behavior.

Use Cases

The primary motivating use case is L2/L3 multipath forwarding, but a generic forwarding group mechanism should support others.

  1. Link Aggregation (LAG): a.k.a. trunks, where multiple links are combined into one logical link.
  2. Equal-cost multi-path routing (ECMP): packets are spread across two choices, where each choice leads to a different switch or router.
  3. Non-equal-cost multi-path routing: biased packet spreading, where each choice leads to a different switch or router.
  4. Fast Reroute: preferentially forward to one path, and when that path fails, quickly change forwarding to a backup path.
  5. Multi-homing: forwarding to one interface for an interface with multiple physical interfaces.
  6. Next-hop aggregation: in BGP, a group of peers may all share a single next hop. Atomic and efficient path changes to the shared next hop can reduce the re-convergence delay.

Note that these use cases can share a generic indirection primitive.

Design Space Discussion

The following discussion is NOT meant to present a specific proposal, but instead, to cover the design space. For the specific proposal, see the section "Specific Proposal", below.

This document originally described multipath forwarding only, but has since been generalized to forwarding groups. Hence, some description may be out-of-order.

Terminology

A multipath (MP) forwarding group consists of a group of action buckets. The phrase bucket is used here as a reminder that a hash on a packet selects a single bucket to use for forwarding the packet.

For an L2 LAG, each bucket consists of a forward-port action. The hash function to choose between buckets might be based on the L2 MAC destination field.

For an L3 ECMP group, each bucket contains a rewrite-destination-MAC action followed by a forward-port action. The hash function might be based on the IP 5-tuple (IP src, dst, proto and L4 src, dst ports)

Implicit vs Explicit Groups

Here's an example. Say source S wants to sent to destination D via paths 1-2-4 and 1-3-4. Switch 1 has two outgoing ports, 0 and 1.

Groups.png

There are at least two main ways to represent this, shown above:

  1. explicit: MP groups are represented as logical ports
  2. implicit: MP forwarding is represented by flow entries with a new MP forwarding action

These methods differ in the interface between the controller and the switch, and the choice affects which side requires more implementation work. In the end, there has to be a hash table somewhere to convert from an MP group identifier (either a logical port or the explicit list of buckets) to an actual hardware group.

Arguments for (1) explicit MP groups:

  • The controller has flexibility over how to allocate and use groups, and thus may do a better job of using more limited hardware group resources.
  • If multiple matches use the same multipath forwarding group as their destination, the forwarding group can be updated with a single explicit group message instead of multiple implicit groups defined in flow mod messages.
  • Provides a way to atomically change the MP forwarding behavior for a set of flow entries. Without a way to change the forwarding for multiple entries, it may be possible to put the network in an incorrect in-between state. An explicit level of forwarding indirection provides this atomicity without the complexity of full database semantics.
  • Provides a way to efficiently change the MP forwarding behavior for a set of flow entries. For example, two types of entries may correspond to different business priorities. These entries may not be a contiguous IP range, so there is no way to select all the entries using a non-strict flow mod. With implicit groups, changing each flow entry would require a flow mod message, yet only one bucket might have actually changed for each group of entries. Explicit groups save control channel bandwidth (since group info is no duplicated) and reduce required switch processing. From a WAN perspective, this may be reason enough to choose explicit groups, since many routes may share one nexthop (and hence the same MP group).
  • Managing the flow entry to group hash table is done on the controller, which likely has more memory and CPU.
  • Leads to a simpler, dumber switch, which seems in the spirit of OpenFlow.

Arguments for (2) implicit MP groups:

  • Possibly an easier interface for programmers, because they don't have to think about sending two messages, or any relative ordering or consistency requirements between the two messages.
  • The controller writer does not need to consider the number of hardware groups, unless they exceed the hardware bound.

Group Types

Note that every argument above applies to both choose-one multipath groups and send-to-all multicast groups; the explicit approach would improve flexibility and add atomicity. The suggested group management functions may also make sense for other types of groups beside multipath, choose-one groups. Here's a starter list:

  • Multipath: send to one port in the group
  • Multicast: send to all ports in the group
  • Flood: send to all ports in the group, except the incoming port
  • Active/passive standby: send to one port if that port is up, otherwise use a second
  • Choose-N: send to a subset of ports in the group
  • Sampling: most of the time forward normally, but sometimes use a second bucket

Ideally, the abstraction chosen by OpenFlow will sufficiently flexible and general to support all the desired use cases.

Group Representation

There are at least three ways to represent group types:

  1. tag: group change messages include an associated type field: {all, one} or bitfield.
  2. static range offered by switch: the group type is tied to the logical port number, and these ranges vary by switch, through a config protocol or other mechanism. E.g. port numbers [100..109] are multipath and port numbers [110-119] are multicast.
  3. static range defined by protocol: partition the port space into different group types.

Static methods expose hardware constraints more naturally, but may be overly static for software switches.

Single Forwarding Numberspace

There is currently one numberspace for ports and one for queues. Should there be a unified forwarding numberspace for all forwarding types, or should groups have a separate space?

The advantage of a unified forwarding ID numberspace would be that one forward message would suffice, as well as one stats type.

Assuming one numberspace, how should the space be internally divided? Options include:

  • Static partitioning: reserve a chunk each for physical ports, special ports, and indirect ports (groups, etc). Disadvantage: past examples show static partitioning to be inefficient and inflexible.
  • Dynamic partitioning: within the space, anything goes. Disadvantage: a controller must now get the set of ports from a switch before it can allocate any groups. If ports are allocated at different spaces, this means a hash table may be required for every switch to do translation, or at least physical port bounds.

If we keep separate numberspaces for groups/indirect ports and physical ports, then we need a few more messages, but the semantics are obvious.

Single Abstraction

State for LAG and ECMP changes at different timescales. Should these be represented with a single abstraction?

Logical Port Consistency

With explicit groups, should the state of a logical port be treated as consistent? Specifically, should an error be returned when a group mod doesn't make sense, such as deleting a non-existent bucket?

Non-equal Multipath

Should non-equal-cost multipath be supported? This could be supported in a number of ways:

  • allowing repeated buckets in a group
  • associating a fraction with each bucket
  • assigning a "replication number" to each bucket, equivalent to repeated buckets but in less space

With add and delete explicit group modifications, we need to think how this interacts with weights. For example, if I have a 2/3 1/3 split (ie. weights are 2 and 1), can I change to a 1/2 1/2 split by sending a delete with weight 1?

Port Ranges

Should the switch expose the set (number and acceptable ranges) of multipath-capable logical ports through the OpenFlow protocol, or should this be out-of-band? Note that the QoS extensions from OF1.0 defined a separate namespace for queues, rather than re-using the ones from ports. Should QoS and groups be consistent in how they use number spaces?

Port Status Change Notification

If a port in a LAG MP group goes down, should the controller receive a special port status change notification, one specific to logical ports?

Empty Action Bucket

If an action bucket contains no forward-port actions, what happens?

Action Buckets with Multiple Ports

If an action bucket contains two forward-port actions, what happens? Is this a legal group definition?

Group Counters

Should a group have associated counters? Should each bucket have its own counter? Maybe per-destination counters in an ECMP group are better supported on hardware, and make more sense? Getting all group stats may be a prohibitively expensive operation, so maybe a separate command to list the set of groups, and not their counters, is worthwhile.

Forwarding to Valid Ports

When a port specified in a multipath group goes down, should/must the switch forward to the remaining set of valid ports? You don't want to intentionally blackhole packets, but the switch could respond much faster to this kind of failure.

Naming

Should the phrase "port group" be used? The phrase virtual port or action container might be a better choice, as a legal "port group" could have only one member.

The phrase 'logical port' may not make sense for groups, which could motivate putting them in a separate numberspace. The service model is different; you can not receive a packet on a group.

Software vs Hardware

If a switch cannot support a requested action combination near line rate, should it refuse the request, or accept it at reduced speed?

Multicast Forwarding out Input Port

Assuming multicast is a group type, can packets be sent out the input port?

Hash Function Selection

For any multipath group, how should the particular hashing function be defined? Some hardware supports multiple hash functions. Some options include:

  • undefined hash function types; everything is done outside the protocol
  • undefined hash function types, with in-protocol hash selection: Vendors could extend their switches to support more specific group types, such as Multipath-Hash-A and Multipath-Hash-B. Hardware you not support such flexibility, though. The hash could be selected:
    • per group
    • per interface
    • by setting a field for each group.
  • protocol-defined hash function types

The implicit choice here is a less self-consistent, but easier to use protocol. Any real implementation is unlikely to specify multipath or multicast without having some control over the hash function.

Port Numberspace Size

16 bits may be insufficient for all groups and ports w/large switches. Expand to 32 bits?

Remove Spanning Tree?

The need for spanning-tree-specific functionality goes away with multicast groups, which provide a generic way to implement this functionality to replace OFPP_FLOOD.  There are many types of spanning tree protocols (STP, RSTP, PVST, MSTP, R-PVST, ..) and only the oldest one (STP) is considered in OpenFlow v1.0. Given the proliferation of ST protocols, the simpler solution seems to be to remove the original STP-specific functionality.

If OFPP_FLOOD is in active use, then it may make sense to to leave the functionality in for OF1.1, but mark it as deprecated. Another consideration is the campus deployments; campus CTO types may be wary of deploying equipment without "spanning tree support".

Fast Reroute Support

Fast Reroute (aka Local Protection or Local Restoration) is an MPLS feature which enables backup paths to be defined in advance. The first logical port defined in a fast reroute group is normally used for forwarding. When some condition associated with the first logical port (such as port connectivity), changes, the second logical port in the group takes over. The switch can independently reroute traffic to the backup path, avoiding long route reconfiguration delays. For details of FRR in MPLS, see [[2]].

How should this be supported in OpenFlow? Is it a special group type, property associated with a flow entry, or something else?

Chained Groups

Should groups be allowed to point to groups?

Advantages:

  • Provides a natural way for a controller to represent two-stage forwarding, where a first stage identifies a next hop and the second stage identifies the specific forwarding to get a packet to the next hop. For example, a BGP router may want to identify the next hop in an initial lookup, but then spread packets across multiple links to get to the next hop. If the first lookup can point to a second-stage nexthop group, then the controller writer can separate these concerns (similar to using only metadata with a multiple tables implementation - yes, there's some overlap here).
  • Provides a closer-to-atomic way for a controller to send forwarding updates. If a switch receives a change to a pointed-to group, it may be able to do the update atomically, possibly reducing a routing protocol's convergence time. In addition, the CPU and control channel bandwidth loads are reduced when fewer messages are sent.
  • Supports combinations. A fast reroute group might need to forward across multiple links in a trunk. Without chaining, we'd need to represent this case with a cross-product of all possible group combinations.


Downside:

  • HAL implementation complexity increases. For every group mod, the switch must know which flow entries to change, and potentially every flow entry may point to the group in the incoming group mod. Without hardware support for multiple layers of forwarding, this can burden the switch CPU.
  • Possible controller/switch incompatibilities if some support chained groups while others don't.

Proposal Philosophy

Given the stability of static LAG memberships and the extreme volatility of MP memberships based on routing application decisions, there has been some argument about the benefits of specifying separate interfaces for each. In the interest of protocol hygiene, we (the OpenFlow 1.1 working group) strongly believe that a single common abstraction should cover all use cases, and that the generic "group of action buckets" approach meets this goal.

Explicit multipath groups are better than implicit. First, they enable controller application knowledge to influence group allocations, rather than forcing the switch to discover these dynamically at flow entry install time. Second, pushing work to the controller while minimizing the runtime complexity of the switch is consistent with the OpenFlow philosophy so far of simple (dumb), minimal switches.

The suggestion of a GROUP_MOD message type comes from a strong desire to decouple configuration modification from decision processes occurring at routing time-scales, hence the decision to add this to the OpenFlow protocol, rather than defer to a separate config protocol. In practice, some switches will not support arbitrary action bucket combinations in hardware, and should reply with an error to these GROUP_MOD messages. Enumerating these capabilities would overly complicate the protocol and would probably not cover the variance of today's hardware anyway.

While we aren't particularly happy about trying to enumerate group types, what seems clear so far is that multipath, multicast, and fast reroute are the most common use cases, and having already-provided types should help users get started. Note that multipath is still underspecified, by not specifically calling out the function that chooses among buckets.

Specific Proposal

Add new OpenFlow message:

GROUP_MOD(logical_port, {set, add, delete}, type, [ [bucket1: weight1, action_list] ... [bucketN: weightN, action_listN] ])

  • Set: sets the logical port to the bucket(s) specified. If no arguments are specified, the group is cleared, and packets matching flow entries pointing to the cleared group are all dropped.
  • Add: adds the specified bucket(s) to the group. Adding a duplicate bucket is not allowed, and should return an error; the the weight field associated with each bucket should be used to control the distribution. If a switch cannot support non-equal-cost forwarding, it should return an error.
  • Delete: removes the specified bucket(s) from the group. Attempting to delete a non-existent bucket is an error.

The type field represents the group type, for which initial values include

  • Multipath: forward to one port at most; the function to choose which port is unspecified. Hardware switches will implement different hash functions; which function to use with a particular logical port is outside the scope of OpenFlow. When a port specified in a multipath group goes down, the switch may restrict forwarding to the remaining set of valid ports. This behavior may reduce the disruption of a downed link or switch.
  • Multicast: forward all out specified ports, except the incoming port, by default. If the controller writer wants to forward out the incoming port, they can, and just like regular forwarding should include the OFPP_IN_PORT action as a bucket.
  • Fast Reroute: forward to the first forwarding identifier (group or physical port) normally, but when some condition changes, such as a path going down, use the next forwarding identifier in the group. When no forwarding identifiers (group or port) are up, drop packets for this flow. Conditions for fast reroute groups are outside the scope of the OpenFlow spec.

The weight field represents a bucket's share of the traffic. When a port goes down, the change in traffic distribution is undefined. The weight field is only defined for multipath groups.

Buckets which consist of a single forward-port action are allowed. A group may also include buckets which themselves forward to other groups. For example, a fast reroute group may have two buckets, where each bucket forwards to a multipath group. If a switch does not support groups of groups, it should reply with an error message when it receives the unsupported flow mod. In addition, a single forward-port action may forward to a multipath group. Switches should support these examples of port group chaining, but they cannot, should reply with an appropriate error type.

Add new codes for Group Mod error type:

GROUP_MOD_FAILED: {INVALID_LOGICAL_PORT, INVALID_ACTIONS, NON_EQUAL_MP_UNSUPPORTED, INVALID_DELETE, OUT_OF_MP_GROUPS, GROUP_OUT_OF_MP_MEMBERS }

Add new code for Flow Mod error type: FLOW_MOD_FAILED: PORT_CHAINING_UNSUPPORTED

Add new OpenFlow *stats* type: GROUP Similar to port stats, but returns group counters: [port, counter0, ... counterN] tuples.

Add new OpenFlow messages to retrieve group lists: GROUPS_{REQUEST/REPLY}(optional group number) Returns each active group, along with the full set of action buckets. If an optional group number is provided, only that group is returned.

Extend Port Space to 32 bits; strong feedback that 16 bits are insufficient. This is a small change.

The set of numbers considered valid for multipath groups is the expanded OpenFlow 32-bit port space minus 0, reserved ports, and any physical port numbers that are used by the switch. The switch does not need to expose its number of maximum multipath groups; this is outside the scope of OpenFlow, but a config protocol is the expected mechanism.

Tests

The set of tests should cover all errors, and is listed here as a first cut to think through any failure cases that might be unspecified in the proposal.

  • add_none: Send group mod with no buckets, then a flow mod which forwards to the MP group, then a matching packet. Is the packet dropped, rather than a packet_in generated?
  • add_one: Send group mod with one bucket, then flow mod forwarding to the MP group, then a matching packet. Is the packet received on the correct port?
  • add_two: Send group mod with two buckets, then flow mod forwarding to the MP group, then a matching packet. Is some packet received on one of the two port?
  • add_invalid: Attempt to send group mods to invalid ports (e.g. 0, OFPP_NONE, OFPP_ALL, etc.). Is an error sent?
  • add_twice: Send group mod with one bucket, then flow mod forwarding to the MP group, then add via group mod another bucket, then send a matching packet. Is some packet received on one of the two ports?
  • add_two_mod: Send group mod with two buckets with modify actions, then flow mod forwarding to the MP group, then send matching packet. Is the modified packet received?
  • add_nonequal: Attempt to send non-equal multipath. Is it handled correctly, or is an error returned?
  • add_nonexistent: Attempt to add flow mod to non-defined logical multipath group. Is an error sent?
  • delete: Add a bucket to a multipath group, then delete that bucket. Check that the packet is dropped.
  • delete_invalid: Set an empty MP group, then attempt to delete some bucket. Is an error message returned?
  • set_{none/one/two} similar to the above for add
  • add_set: Add a bucket, then set to a different bucket. Does forwarding with the second bucket work properly? The first should have been overwritten.
  • set_clear: Add a bucket to a group, then clear that bucket. Is a packet-in generated for the corresponding flow entry when a matching packet is sent?
  • l2_random: Modify the L2 dst randomly. Do packets arrive at different ports?
  • l3_random: Modify the L3 dst randomly. Do packets arrive at different ports?

For each test, you'd also want to verify that the correct group stats are returned after the change operations.

This list is missing tests showing interactions with QoS.