Linux 网络bond mode 4 的xmit_hash_policy layer3+4 到底是如何hash 的

2020-01-10

在通过iperf3 测试4 块网卡做的lacp 链路聚合时，xmit_hash_policy 选择的是layer3+4. 当iperf3 指定的线程数比较少时，总是打不满带宽。
例如在源目的IP和目的端口不变的情况下，四个线程使用连续的4个源端口10001~10004 (iperf3 参数 -P 4-B src_ip –cport 10001)测试tcp 只能打出3 个网卡的效果，从/proc/net/dev 中查看计数，以及从tcpdump 的抓包中，发现只有1,2,3 三块网卡有出流量。
其中10001 用网卡1,10002 和10003 用网卡2,10004 用网卡3，网卡四没有发包。
后续测试其他连续端口，也发现类似的情况，甚至有时候只能打出两张卡的效果。

于是去查询了kernel bonding 文档得知其hash 方式如下：

layer3+4

   This policy uses upper layer protocol information,
   when available, to generate the hash.  This allows for
   traffic to a particular network peer to span multiple
   slaves, although a single connection will not span
   multiple slaves.

   The formula for unfragmented TCP and UDP packets is

   hash = source port, destination port (as in the header)
   hash = hash XOR source IP XOR destination IP
   hash = hash XOR (hash RSHIFT 16)
   hash = hash XOR (hash RSHIFT 8)
   And then hash is reduced modulo slave count.

   If the protocol is IPv6 then the source and destination
   addresses are first hashed using ipv6_addr_hash.

   For fragmented TCP or UDP packets and all other IPv4 and
   IPv6 protocol traffic, the source and destination port
   information is omitted.  For non-IP traffic, the
   formula is the same as for the layer2 transmit hash
   policy.

   This algorithm is not fully 802.3ad compliant.  A
   single TCP or UDP conversation containing both
   fragmented and unfragmented packets will see packets
   striped across two interfaces.  This may result in out
   of order delivery.  Most traffic types will not meet
   this criteria, as TCP rarely fragments traffic, and
   most UDP traffic is not involved in extended
   conversations.  Other implementations of 802.3ad may
   or may not tolerate this noncompliance.

但按照这个算法算下来的话，10001-10004 四个端口是可以分到四块网卡的。

def xmit_hash(slaves, sport, dport, sip, dip):
    hash = (sport << 16) + dport
    hash = hash ^ sip ^ dip
    hash = hash ^ (hash >> 16)
    hash = hash ^ (hash >> 8)
    return (hash % slaves)

搞不定，就去翻了下源码drivers/net/bonding/bond_main.c，结果发现这玩意居然和文档里写的不一样：

hash ^= (__force u32)flow_get_u32_dst(&flow) ^
	(__force u32)flow_get_u32_src(&flow);
hash ^= (hash >> 16);
hash ^= (hash >> 8);

return hash >> 1;

翻译成python 大概是:

def xmit_hash(slaves, sport, dport, sip, dip):
    hash = (sport << 16) + dport
    hash = hash ^ sip ^ dip
    hash = hash ^ (hash >> 16)
    hash = hash ^ (hash >> 8)
    return ((hash>>1) % slaves)

这个hash >> 1 就很讲究了，这样算出来在4 网卡聚合的情况下，保持源目的地址和目的端口不变，还是上面的例子，(2n, 2n+1)的源端口号会被分配到同一块卡上，就是10001 一块卡， 10002 和10003 一块卡，10004 一块卡。用三块卡，闲置一块卡，这样的结果确实和测试吻合。

那么为什么要这么搞呢？好在git log 里有答案：

# git log -p b5f862180d7011d9575d0499fa37f0f25b423b12
Author: Hangbin Liu <liuhangbin@gmail.com>
Date:   Mon Nov 6 09:01:57 2017 +0800

    bonding: discard lowest hash bit for 802.3ad layer3+4
    
    After commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
    in connect()"), we will try to use even ports for connect(). Then if an
    application (seen clearly with iperf) opens multiple streams to the same
    destination IP and port, each stream will be given an even source port.
    
    So the bonding driver's simple xmit_hash_policy based on layer3+4 addressing
    will always hash all these streams to the same interface. And the total
    throughput will limited to a single slave.
    
    Change the tcp code will impact the whole tcp behavior, only for bonding
    usage. Paolo Abeni suggested fix this by changing the bonding code only,
    which should be more reasonable, and less impact.
    
    Fix this by discarding the lowest hash bit because it contains little entropy.
    After the fix we can re-balance between slaves.
    
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c99dc59d729b..76e8054bfc4e 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3253,7 +3253,7 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb)
        hash ^= (hash >> 16);
        hash ^= (hash >> 8);
 
-       return hash;
+       return hash >> 1;
 }

看名字这位仁兄好像还是个中国人，大意是在2015 的另一个提交07f4c90062f8 改变了一些随机端口号的使用规则，现在使用iperf3 等工具时，随机的源端口号是每次+2，而不是每次+1，这样如果还是原来的xmit_hash_policy 会有一些问题，例如4 网卡的话永远都只能hash 到两个卡。所以他舍弃了hash 的最后一位来应对这个变化。
这个commit 是在2017 年提交，在CentOS/RHEL 7.3 的3.10.0-514 内核中还没有被添加进去。

对于随机端口号每次+2 这个现象，之前用python 做测试的时候也遇见过不过没有深究，现在也找到了答案，算是缘分吧。