In this exercise, we will implement a layer-3 forwarding switch that can load balance traffic towards a destination across equal cost paths.
We will implement ECMP (Equal-Cost Multi-Path) routing to load balance traffic across multiple ports.
When a packet with multiple candidate paths arrives, our switch should assign the next-hop by hashing some fields from the
header and compute this hash value modulo the number of possible equal paths. For example, in the topology below, when s1
has to send a packet to h2
, the switch should determine the output port by computing: hash(some-header-fields) mod 4
.
The switch does ECMP hashing on a per-flow basis to prevent out-of-order packets, which means that all packets with the same source and destination IP addresses and source and destination ports always hash to the same next hop.
To read more about ECMP, see the following page.
As usual, we provide you with the following files:
p4app.json
: a configuration file that describes the topology we want to create with the help of mininet and p4-utils package.network.py
: a Python script that initializes the topology using Mininet and P4-Utils. One can usenetwork.py
orp4app.json
indifferently to start the network.p4src/ecmp.p4
: p4 program skeleton to use as a starting point.p4src/includes
: We split our p4 code into multiple files for the first time in today's exercise. In theinclude
directory, you will findheaders.p4
andparsers.p4
(you need to complete them as well).send.py
: a small python script to generate packets with different TCP port numbers.
We will use a new IP assignment strategy for this exercise (and the next one). If you have a look at p4app.json
you will see that
the option is set to mixed
. That means only hosts connected to the same switch will be assigned to the same subnet. Hosts connected
to a different switch will belong to a different /24
subnet. If you use the namings hZ
and sXY
(e.g h1, h2, s1...), the IP assignment
goes as follows: 10.x.y.z
. Where x
and y
are the respective upper and the lower bytes of the gateway switch ID, and z
is the host id. For example, in the topology above,
h1
gets 10.0.1.1
and h2
gets 10.0.6.2
.
You can find all the documentation about p4app.json
in the p4-utils
documentation. Also, you can find information about assignment strategies here.
To solve this exercise, we have to program our switch such that it can forward L3 packets when there is one possible next hop or more.
For that, we will use two tables: in the first table, we match the destination IP, and depending on whether ECMP has to be applied (for that destination), we set the output port or an ecmp_group
.
Next, we apply a second table that maps (ecmp_group, hash_output) to egress ports.
This time you will have to fill the gaps in several files: p4src/ecmp.p4
, p4src/include/headers.p4
and p4src/include/parsers.p4
.
Additionally, you will have to create a cli
command file for each switch and name them sX-commands.txt
(see inside the p4app.json
).
We placed all the command files inside sw-commands
directory to keep the workspace more organized.
To complete the exercise, you have to do the following:
-
Use the header definitions that we already provided.
-
Define the parser that can parse packets up to
tcp
. Note that we do not considerudp
packets in this exercise for simplicity. This time you must define the parser inp4src/include/parsers.p4
. -
Define the deparser. Just emit all the headers in the correct order.
-
Define a match-action table that matches the IP destination address of every packet and has three actions:
set_nhop
,ecmp_group
,drop
. Set the drop action as default. -
Define the action
set_nhop
. This action takes two parameters: destination mac and egress port. Use the parameters to set the destination mac andegress_spec
. Set the source mac as the previous destination mac (this is not what an actual L3 switch would do, but we do it for simplicity). A more realistic implementation would create a table that maps egress_ports to each switch interface mac address. However, since the source mac address is not very important for this exercise, do this swap for now). When sending packets from a switch to another switch, the destination MAC address is not very important; thus, you can use an arbitrary one. However, keep in mind that when the packet is sent to a host needs to have the correct destination MAC address. Finally, decrease the packet's TTL by 1. Note: since we are in an L3 network when you send packets froms1
tos2
you have to use the dst mac of the switch interface, not the mac address of the receiving host, which is done in the very last hop. -
Define the action
ecmp_group
. This action takes two parameters, theecmp_group_id
(14 bits) and the number of next hops (16 bits). This action is one of the critical parts of the ECMP algorithm. You have to do several things:- In this action, we will compute a hash function. To store the output, you need to define a metadata field. Define
ecmp_hash
(14 bits) inside the metadata struct inheaders.p4
. Use thehash
extern function to compute the hash of packets 5-tuple (src ip, dst ip, src port, dst port, protocol). The signature of a hash function is:hash(output_field, (crc16 or crc32), (bit<1>)0, {fields to hash}, (bit<16>)modulo)
. You can find all the available hash functions here. For example, if you want to usecrc32
, your algorithm will beHashAlgorithm.crc32
. - Define another metadata field and call it
ecmp_group_id
(14 bits). - Finally, copy the value of the second action parameter
ecmp_group
in the metadata field you just defined (ecmp_group_id
). We use this as the key to match the second table.
- In this action, we will compute a hash function. To store the output, you need to define a metadata field. Define
Note: A lot of people asked why the ecmp_group_id
is needed. In a few words, it is a level of indirection that allows you to map from one IP address to a set of ports, which does not have to be the four ports we use in this exercise. For example, you could have that for IP(1)
, you use only the upper two ports, and for IP(2)
, you load balance using the two lower ports.
Thus, by creating two ECMP groups, you can easily map any destination address to any set of ports.
-
Define the second match-action table used to set
ecmp_group
to the next hops. The table should haveexact
matches to the metadata fields you defined in the previous step. Thus, it should match to themeta.ecmp_group_id
and then to the output of the hash functionmeta.ecmp_hash
(which will be a value ranging from 0 toNUM_NEXT_HOPS-1
). A match in this table should call theset_nhop
action you already defined above, and a miss should mark the packet to be dropped (setdrop
as default action). This indirection enables us to use any subset of interfaces. For example imagine that in the topology above, we haveh2
andh3
( h3 does not exist, but just for the sake of the example), we could define two differentecmp_group
(in the previous table), one that maps to port 2 and 4, and one that maps to port 3 and 5. And then, in this table, we could add two rules per group to make the outputs[0,1]
from the hash function match[2,4]
and[3,5]
, respectively. -
Define the ingress control logic:
- Check if the ipv4 header was parsed (use
isValid
). - Apply the first table.
- If the action
ecmp_group
was called during the first table-apply. Call the second table. Note: to know which action was called during an apply you can use a switch statement andaction_run
, to see more information about how to check which action was used, check out the P4 16 specification
- Check if the ipv4 header was parsed (use
-
In this exercise, we modify an IP header's field for the first time (remember we have to subtract 1 from the ip.ttl field). When doing so, the
ipv4
checksum field needs to be updated; otherwise, other network devices (or receiving hosts) might drop the packet. For that purpose, thev1model
provides anextern
function that can be called inside theMyComputeChecksum
control to update checksum fields. In this exercise, you do not have to do anything. Instead, go to theecmp.p4
file and check how the code usesupdate_checksum
. -
This time you have to write six
sX-commands.txt
files, one per switch. Note that onlys1
ands6
need to haveecmp_group
installed. For all the other switches setting rules for the first table (using actionset_nhop
) will suffice. Fors1
you have to set a direct next hop towardsh1
, and an ecmp group towardsh2
. Set the ecmp group withid = 1
andnum_hops = 4
. Then define 4 rules that map from 0 to 3 to one of the 4 switch output ports using the second table.
Once you have the ecmp.p4
program finished (and all the commands.txt
files), you can test its behavior:
-
Start the topology (this will also compile and load the program).
sudo p4run
or
sudo python network.py
-
Check that you can ping:
mininet> pingall
-
Monitor the 4 links from
s1
that will be used duringecmp
(froms1-eth2
tos1-eth5
). By doing this, you will be able to check which path each flow takes.sudo tcpdump -enn -i s1-ethX
-
Ping between two hosts:
You should see traffic in only 1 or 2 interfaces (due to the return path) since all the ping packets of the same flow have the same 5-tuple.
-
Do iperf between two hosts:
You should also see traffic in 1 or 2 interfaces (due to the return path). Since a flow's packets have the same 5-tuple, and thus the hash always returns the same index.
-
Get a terminal in
h1
. Use thesend.py
script.python send.py 10.0.6.2 1000
This will send
tcp syn
packets with random ports. Now you should see packets going to all the interfaces since each packet will have a different hash.
We have added a small guideline in the documentation section. Use it as a reference when things do not work as expected.