Configuring NVIDIA BlueField2 SmartNIC

Jan 6, 2022

SmartNIC is a new emerging hardware where a NIC with general-purpose CPU cores. NVIDIA BlueField2 equips 8 ARM Cortex A-72 cores, which can be used to process offloaded functions. This functions are not limited to packet processing, but can also be more complicated applications, e.g., file system, etc.

This post talks about how to configure NVIDIA BlueField2 SmartNIC, on CloudLab r7525 machines.

Initial Setup

First, install Mellanox OFED device drivers from here. Current latest version is 5.5-1.0.3.2 and LTS version is 4.9-4.1.7.0, but I used 5.4-3.1.0.0 version.

Once you untar the archive, run mlnxofedinstall to install:

$ sudo ./mlnxofedinstall --without-fw-update
...
$ sudo /etc/init.d/openibd restart
Unloading HCA drvier:                [ OK ]
Loading HCA driver and Access Layer: [ OK ]

Now, you will see a new network interface tmfifo_net0, managed by rshim host driver. You can access BlueField via this interface.

NVIDIA and Mellanox explains BlueField can be accessed via rshim USB and rshim PCIe, which are managed by kernel modules named rshim_usb.ko and rshim_pcie.ko, respectively, according to the Section 2.7 of the following manual. But it seems to be changed, and there is no kernel modules installed named rshim* and just a process /usr/sbin/rhsim is used, managed by systemd.

$ lsmod | grep -i rshim
(nothing)
$ modinfo rshim
modinfo: ERROR: Module rshim not found.
$ systemctl status rshim
● rshim.service - rshim driver for BlueField SoC
     Loaded: loaded (/lib/systemd/system/rshim.service; enabled; vendor preset: enabled)
    Active: active (running) since Thu 2022-01-06 10:31:28 EST; 11h ago
      Docs: man:rshim(8)
  Main PID: 465131 (rshim)
     Tasks: 6 (limit: 618624)
    Memory: 2.8M
    CGroup: /system.slice/rshim.service
            └─465131 /usr/sbin/rshim
...

the source of which is open in here.

The other endpoint of tmfifo_net0 is in BlueField, and its IPv4 address is 192.168.100.2/24, hence you can use ssh to access BlueField via this address. First, we need to assign an address.

$ ip addr add 192.168.100.1/24 dev tmfifo_net0
$ ping 192.168.100.2 # check whether it is properly configured
PING 192.168.100.2 (192.168.100.2) 56(84) bytes of data.
64 bytes from 192.168.100.2: icmp_seq=1 ttl=64 time=0.821 ms
...
$ ssh ubuntu@192.168.100.2
ubuntu@192.168.100.2's password: ubuntu

By default username and password are both ubuntu.

You are now in Linux running on BlueField2 ARM CPU.

$ lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           0
Model name:                      Cortex-A72
...

Configuring Embedded CPU Function Ownership (ECPF) Mode

NVIDIA BlueField2 SmartNIC has three modes of operation as of now 1:

  1. Separated host mode (symmetric model)
  2. Embedded function (ECPF)
  3. Restricted mode

To check which mode a SmartNIC is running on, use the following command in the host:

$ mst start
$ mst status -v # identify the MST device
$ mlxconfig -d /dev/mst/mtXXXXX_pciconf0 q | grep -i internal_cpu_model
INTERNAL_CPU_MODEL                  EMBEDDED_CPU(1)

If you see SEPARATED_HOST(0), you can change it by:

$ mlxconfig -d /dev/mst/mtXXXXX_pciconf0 s INTERNAL_CPU_MODEL=1
$ mlxconfig -d /dev/mst/mtXXXXX_pciconf0.1 s INTERNAL_CPU_MODEL=1
$ # restart the server
image

BlueField2 SmartNIC modes (Separated and ECPF mode). [Source]

In separated host mode, the host CPU and ARM CPU are like separate nodes, but sharing the same NIC. In ECPF mode, however, all communications between the host server and the outer world go through the SmartNIC ARM. For more detailed information, refer to this post2.

NVIDIA document says ECPF mode is a default (Two years ago Mellanox said separated mode was a default mode. Not sure which one is actual), but anyway I will use ECPF mode.

When in ECPF ownership mode, BlueField uses netdev representors to map each one of the host side physical and virtual functions 3. You can see a lot of network interfaces in BlueField:

$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: tmfifo_net0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 00:1a:ca:ff:ff:01 brd ff:ff:ff:ff:ff:ff
3: oob_net0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:a4:89:ea brd ff:ff:ff:ff:ff:ff
4: p0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:a4:89:e4 brd ff:ff:ff:ff:ff:ff
5: p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master ovs-system state DOWN mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:a4:89:e5 brd ff:ff:ff:ff:ff:ff
6: pf0hpf: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether d2:d4:df:81:68:1f brd ff:ff:ff:ff:ff:ff
7: pf1hpf: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether c6:ee:a2:c6:fa:e2 brd ff:ff:ff:ff:ff:ff
8: en3f0pf0sf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether fe:18:2b:c7:14:b8 brd ff:ff:ff:ff:ff:ff
9: enp3s0f0s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 02:9e:0f:d4:58:dc brd ff:ff:ff:ff:ff:ff
10: en3f1pf1sf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether fe:1f:d8:0d:c7:f1 brd ff:ff:ff:ff:ff:ff
11: enp3s0f1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 02:fb:ba:ee:89:92 brd ff:ff:ff:ff:ff:ff
12: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ca:e4:08:3d:0f:fc brd ff:ff:ff:ff:ff:ff
13: ovsbr1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 0c:42:a1:a4:89:e4 brd ff:ff:ff:ff:ff:ff
14: ovsbr2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

The naming convention for the representors is as follows, according to the DOCA documentation 3:

  • Uplink representors: p<portnum>
  • PF rerpesentors: pf<portnum>hpf
  • VF representors: pf<portnum>vf<funcnum>
image

BlueField2 kernel representors model. The red arrow demonstrates a packet flow through the representors, while the green arrow demonstrates the packet flow wheen steering rules are offloaded to the embedded switch (E-switch). [Source]

As mentioned before, all communications from the host go through SmartNIC ARM cores, and the virtual switch in Open vSwitch running on the ARM cores allows it.

$ systemctl status openvswitch-switch.service
● openvswitch-switch.service - LSB: Open vSwitch switch
     Loaded: loaded (/etc/init.d/openvswitch-switch; generated)
     Active: active (exited) since Wed 2021-07-21 19:00:40 UTC; 13h ago
       Docs: man:systemd-sysv-generator(8)
    Process: 2374 ExecStart=/etc/init.d/openvswitch-switch start (code=exited, status=0/SUCCESS)
...
$ ovs-vsctl show
    Bridge ovsbr1
        Port pf0hpf
            Interface pf0hpf
        Port ovsbr1
            Interface ovsbr1
                type: internal
        Port en3f0pf0sf0
            Interface en3f0pf0sf0
        Port p0
            Interface p0
    Bridge ovsbr2
        Port p1
            Interface p1
        Port en3f1pf1sf0
            Interface en3f1pf1sf0
        Port ovsbr2
            Interface ovsbr2
                type: internal
        Port pf1hpf
            Interface pf1hpf
    ovs_version: "2.14.1"

NVIDIA BlueField2 supports OVS offloading using NVIDIA ASAP2 (Accelerated Switching and Packet Processing) technology. Refer to this for more information.

Note that there is no VF representors in my environment; instead, it has SF representors. First I thought it is a subfunction, but it is a unique name called scalable function that NVIDIA uses only for BlueField-2 DPUs 4.

image

Scalable Functions (SFs) co-exist with PCIe SR-IOV virtual functions, and they do not need enabling PCIe SR-IOV. Each SF has its own dedicated queues which are neither shared nor stolen from the parent PCIe function. [Source]

Configuring SmartNIC Network Connectivity

As of now, Linux on SmartNIC cannot access to another SmartNIC on another machine, nor even to the host except through rshim layer.

Host Connection

As I don’t use a DHCP server, I should manually assign IPv4 addresses to SF, according to the manual.

# on the SmartNIC Linux
$ ovs-vsctl show
    Bridge ovsbr1
        Port en3f0pf0sf0
            Interface en3f0pf0sf0
    ...
$ ip link
8: en3f0pf0sf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000
    link/ether fe:18:2b:c7:14:b8 brd ff:ff:ff:ff:ff:ff
9: enp3s0f0s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:9e:0f:d4:58:dc brd ff:ff:ff:ff:ff:ff
...
$ ip addr add 192.168.10.2/24 dev enp3s0f0s0

# on the host Linux
$ ip link
...
15: ens5f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 0c:42:a1:a4:89:e0 brd ff:ff:ff:ff:ff:ff
$ ip addr add 192.168.10.1/24 dev ens5f0

Refer to this answer to find out why there are two similar network interfaces:

There are two representors for each one of the DPU’s network ports: one for the uplink, and another one for the host side PF (the PF representor created even if the PF is not probed on the host side).

If you check mlnx-sf -a show on the DPU you’ll see the kernel netdev and the representor names for the DPU itself. The Representor netdev devices are already connected (added to) the OVS bridge by default, so we only need to use netplan to assign an IP address to enp3s0f1s0 or enp3s0f0s0 (or both) and that should work.

Note that the following is my result of mlnx-sf -a show run on SmartNIC Linux:

$ mlnx-sf -a show

SF Index: pci/0000:03:00.0/229408
  Parent PCI dev: 0000:03:00.0
  Representor netdev: en3f0pf0sf0
  Function HWADDR: 02:9e:0f:d4:58:dc
  Auxiliary device: mlx5_core.sf.2
    netdev: enp3s0f0s0
    RDMA dev: mlx5_2

SF Index: pci/0000:03:00.1/294944
  Parent PCI dev: 0000:03:00.1
  Representor netdev: en3f1pf1sf0
  Function HWADDR: 02:fb:ba:ee:89:92
  Auxiliary device: mlx5_core.sf.3
    netdev: enp3s0f1s0
    RDMA dev: mlx5_3

That’s why I put an IP address to the device enp3s0f0s0, not en3f0pf0sf0.

Now it is possible to ping each other.

# on the SmartNIC Linux
$ ping 192.168.10.1 -c 2
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=51.8 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.114 ms

# on the host Linux
$ ping 192.168.10.2 -c 2
PING 192.168.10.2 (192.168.10.2) 56(84) bytes of data.
64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=33.7 ms
64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.143 ms

Connecting Multiple Nodes

If host PFs are in the same subnet, they are reachable as well. For example,

  1. Node 1 host: ens5f0: 192.168.10.1/24
  2. Node 1 SmartNIC: enp3s0f0s0: 192.168.10.2/24
  3. Node 2 host: ens5f0: 192.168.10.3/24
  4. Node 2 SmartNIC: enp3s0f0s0: 192.168.10.4/24

Ping from 1 to 2 (host to SmartNIC in the same machine):

$ ping 192.168.10.2 -c 2
PING 192.168.10.2 (192.168.10.2) 56(84) bytes of data.
64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=15.5 ms
64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.133 ms

Ping from 1 to 3 (host to another host):

$ ping 192.168.10.3 -c 2
PING 192.168.10.3 (192.168.10.3) 56(84) bytes of data.
64 bytes from 192.168.10.3: icmp_seq=1 ttl=64 time=94.4 ms
64 bytes from 192.168.10.3: icmp_seq=2 ttl=64 time=0.078 ms

Ping from 1 to 4 (host to SmartNIC in another machine):

$ ping 192.168.10.4 -c 2
PING 192.168.10.4 (192.168.10.4) 56(84) bytes of data.
64 bytes from 192.168.10.4: icmp_seq=1 ttl=64 time=0.403 ms
64 bytes from 192.168.10.4: icmp_seq=2 ttl=64 time=25.6 ms

Ping from 2 to 3 (SmartNIC to another host):

$ ping 192.168.10.3 -c 2
PING 192.168.10.3 (192.168.10.3) 56(84) bytes of data.
64 bytes from 192.168.10.3: icmp_seq=1 ttl=64 time=97.2 ms
64 bytes from 192.168.10.3: icmp_seq=2 ttl=64 time=0.087 ms

Ping from 2 to 4 (SmartNIC to another SmartNIC):

$ ping 192.168.10.4 -c 2
PING 192.168.10.4 (192.168.10.4) 56(84) bytes of data.
64 bytes from 192.168.10.4: icmp_seq=1 ttl=64 time=0.217 ms
64 bytes from 192.168.10.4: icmp_seq=2 ttl=64 time=0.187 ms

Enable NAT for Network in SmartNIC Linux

In CloudLab configuration, only eno1 interface in the host is connected to the Internet. Hence, SmartNIC Linux needs to go through the host Linux to access the Internet. This post well explains how to setup NAT between them 5, so I just summarized the way here.

# On the host
$ echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward
$ iptables -t nat -A POSTROUTING -o eno1 -j MASQUERADE

# In case IP forwarding is not working
$ iptables -A FORWARD -o eno1 -j ACCEPT
$ iptables -A FORWARD -m state --state ESTABLISHED,RELATED -i eno1 - j ACCEPT

# On the SmartNIC
$ echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# On the SmartNIC
$ ping google.com -c 2
PING google.com (74.125.21.100) 56(84) bytes of data.
64 bytes from yv-in-f100.1e100.net (74.125.21.100): icmp_seq=1 ttl=108 time=7.19 ms
64 bytes from yv-in-f100.1e100.net (74.125.21.100): icmp_seq=2 ttl=108 time=6.93 ms

  1. NVIDIA BlueField DPU Family Software Documentation: DPU Operation ↩︎

  2. NVIDIA Mellanox Bluefield-2 SmartNIC Hands-on Tutorial Part 2: Change Mode of Operation and Install DPDK ↩︎

  3. NVIDIA DOCA vSwitch and Representors Model ↩︎

  4. NVIDIA BlueField DPU OS Platform Documentation: Scalable Functions ↩︎

  5. NVIDIA Mellanox Bluefield-2 SmartNIC Hands-on Tutorial Part 1: Install Drivers and Access the SmartNIC ↩︎

comments powered by Disqus