Skip to content

RDMA with RoCE

English简体中文

Introduction

This chapter introduces how POD access network with the RoCE interface of the host.

Features

RDMA devices' network namespaces have two modes: shared and exclusive. Containers can either share or exclusively access RDMA network cards. In Kubernetes, shared cards can be utilized with macvlan or ipvlan CNI, while the exclusive one can be used with SR-IOV CNI.

  • Shared mode. Spiderpool leverages macvlan or ipvlan CNI to expose RoCE network cards on the host machine for all Pods. The RDMA shared device plugin is employed for exposing RDMA card resources and scheduling Pods.

  • Exclusive mode. Spiderpool utilizes SR-IOV CNI to expose RDMA cards on the host machine for Pods, providing access to RDMA resources. RDMA CNI is used to ensure isolation of RDMA devices.

    For isolated RDMA network cards, at least one of the following conditions must be met:

    (1) Kernel based on 5.3.0 or newer, RDMA modules loaded in the system. rdma-core package provides means to automatically load relevant modules on system start

    (2) Mellanox OFED version 4.7 or newer is required. In this case it is not required to use a Kernel based on 5.3.0 or newer.

Shared RoCE NIC with macvlan or ipvlan

The following steps demonstrate how to enable shared usage of RDMA devices by Pods in a cluster with two nodes via macvlan CNI:

  1. Ensure that the host machine has an RDMA card installed and the driver is properly installed, ensuring proper RDMA functioning.

    In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 NIC with RoCE capabilities. Follow the official NVIDIA guide to install the latest OFED driver. To confirm the presence of RDMA devices, use the following command:

    To confirm the presence of RoCE devices, use the following command:

    ~# rdma link
    link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens6f0np0
    link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens6f1np1
    
    ~# ibstat mlx5_0 | grep "Link layer"
    Link layer: Ethernet
    

Make sure that the RDMA subsystem of the host is in shared mode. If not, switch to shared mode.

    ~# rdma system
    netns shared copy-on-fork on

    # switch to shared mode
    ~# rdma system set netns shared
  1. Verify the details of the RDMA card for subsequent device resource discovery by the device plugin.

    Enter the following command with NIC vendors being 15b3 and its deviceIDs being 1017:

    ~# lspci -nn | grep Ethernet
    af:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    af:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    
  2. Install Spiderpool and configure sriov-network-operator:

    helm upgrade --install spiderpool spiderpool/spiderpool --namespace kube-system  --reuse-values \
       --set rdma.rdmaSharedDevicePlugin.install=true \
       --set rdma.rdmaSharedDevicePlugin.deviceConfig.resourcePrefix="spidernet.io" \
       --set rdma.rdmaSharedDevicePlugin.deviceConfig.resourceName="hca_shared_devices" \
       --set rdma.rdmaSharedDevicePlugin.deviceConfig.rdmaHcaMax=500 \
       --set rdma.rdmaSharedDevicePlugin.deviceConfig.vendors="15b3" \
       --set rdma.rdmaSharedDevicePlugin.deviceConfig.deviceIDs="1017"
    
    • If Macvlan is not installed in your cluster, you can specify the Helm parameter --set plugins.installCNI=true to install Macvlan in your cluster.

    • If you are a user from China, you can specify the parameter --set global.imageRegistryOverride=ghcr.m.daocloud.io to avoid image pull failures from Spiderpool.

    After completing the installation of Spiderpool, you can manually edit the spiderpool-rdma-shared-device-plugin configmap to reconfigure the RDMA shared device plugin.

    Once the installation is complete, the following components will be installed:

    ~# kubectl get pod -n kube-system
    spiderpool-agent-9sllh                         1/1     Running     0          1m
    spiderpool-agent-h92bv                         1/1     Running     0          1m
    spiderpool-controller-7df784cdb7-bsfwv         1/1     Running     0          1m
    spiderpool-init                                0/1     Completed   0          1m
    spiderpool-rdma-shared-device-plugin-dr7w8     1/1     Running     0          1m
    spiderpool-rdma-shared-device-plugin-zj65g     1/1     Running     0          1m
    
  3. View the available resources on a node, including the reported RDMA device resources:

    ~# kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]'
      [
        {
          "name": "10-20-1-10",
          "allocable": {
            "cpu": "40",
            "memory": "263518036Ki",
            "pods": "110",
            "spidernet.io/hca_shared_devices": "500",
            ...
          }
        },
        ...
      ]
    

    If the reported resource count is 0, it may be due to the following reasons:

    (1) Verify that the vendors and deviceID in the spiderpool-rdma-shared-device-plugin configmap match the actual values.

    (2) Check the logs of the rdma-shared-device-plugin. If you encounter errors related to RDMA NIC support, try installing apt-get install rdma-core or dnf install rdma-core on the host machine.

    error creating new device: "missing RDMA device spec for device 0000:04:00.0, RDMA device \"issm\" not found"

  4. Create macvlan CNI configuration with specifying spec.macvlan.master to be an RDMA of the node ,and set up the corresponding ippool resources:

    cat <<EOF | kubectl apply -f -
    apiVersion: spiderpool.spidernet.io/v2beta1
    kind: SpiderIPPool
    metadata:
      name: v4-81
    spec:
      gateway: 172.81.0.1
      ips:
      - 172.81.0.100-172.81.0.120
      subnet: 172.81.0.0/16
    ---
    apiVersion: spiderpool.spidernet.io/v2beta1
    kind: SpiderMultusConfig
    metadata:
      name: macvlan-ens6f0np0
      namespace: kube-system
    spec:
      cniType: macvlan
      macvlan:
        master:
        - "ens6f0np0"
        ippools:
          ipv4: ["v4-81"]
    EOF
    
  5. Following the configurations from the previous step, create a DaemonSet application that spans across nodes for testing

    ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/macvlan-ens6f0np0"
    RESOURCE="spidernet.io/hca_shared_devices"
    NAME=rdma-macvlan
    cat <<EOF | kubectl apply -f -
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: ${NAME}
      labels:
        app: $NAME
    spec:
      selector:
        matchLabels:
          app: $NAME
      template:
        metadata:
          name: $NAME
          labels:
            app: $NAME
          annotations:
            ${ANNOTATION_MULTUS}
        spec:
          containers:
          - image: docker.io/mellanox/rping-test
            imagePullPolicy: IfNotPresent
            name: mofed-test
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            resources:
              limits:
                ${RESOURCE}: 1
            command:
            - sh
            - -c
            - |
              ls -l /dev/infiniband /sys/class/net
              sleep 1000000
    EOF
    
  6. Verify that RDMA communication is correct between the Pods across nodes.

    Open a terminal and access one Pod to launch a service:

    # You are able to see all the RDMA cards on the host machine
    ~# rdma link
    0/1: mlx5_0/1: state ACTIVE physical_state LINK_UP
    1/1: mlx5_1/1: state ACTIVE physical_state LINK_UP
    
    # Start an RDMA service
    ~# ib_read_lat
    

    Open a terminal and access another Pod to launch a service:

    # You are able to see all the RDMA cards on the host machine
    ~# rdma link
    0/1: mlx5_0/1: state ACTIVE physical_state LINK_UP
    1/1: mlx5_1/1: state ACTIVE physical_state LINK_UP
    
    # Access the service running in the other Pod
    ~# ib_read_lat 172.81.0.120
    ---------------------------------------------------------------------------------------
                        RDMA_Read Latency Test
     Dual-port       : OFF    Device         : mlx5_0
     Number of qps   : 1    Transport type : IB
     Connection type : RC   Using SRQ      : OFF
     TX depth        : 1
     Mtu             : 1024[B]
     Link type       : Ethernet
     GID index       : 12
     Outstand reads  : 16
     rdma_cm QPs   : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0000 QPN 0x0107 PSN 0x79dd10 OUT 0x10 RKey 0x1fddbc VAddr 0x000000023bd000
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:119
     remote address: LID 0000 QPN 0x0107 PSN 0x40001a OUT 0x10 RKey 0x1fddbc VAddr 0x00000000bf9000
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:120
    ---------------------------------------------------------------------------------------
     #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
    Conflicting CPU frequency values detected: 2200.000000 != 1040.353000. CPU Frequency is not max.
    Conflicting CPU frequency values detected: 2200.000000 != 1849.351000. CPU Frequency is not max.
     2       1000          6.88           16.81        7.04              7.06         0.31      7.38        16.81
    ---------------------------------------------------------------------------------------
    

Isolated RoCE NIC with SR-IOV

The following steps demonstrate how to enable isolated usage of RDMA devices by Pods in a cluster with two nodes via SR-IOV CNI:

  1. Ensure that the host machine has an RDMA and SR-IOV enabled card and the driver is properly installed.

    In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 NIC with RoCE capabilities. Follow the official NVIDIA guide to install the latest OFED driver.

    To confirm the presence of RoCE devices, use the following command:

    ~# rdma link
    link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens6f0np0
    link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens6f1np1
    
    ~# ibstat mlx5_0 | grep "Link layer"
    Link layer: Ethernet
    

    Make sure that the RDMA subsystem on the host is operating in exclusive mode. If not, switch to exclusive mode.

    # switch to exclusive mode and fail to restart the host 
    ~# rdma system set netns exclusive
    # apply persistent settings: 
    ~# echo "options ib_core netns_mode=0" >> /etc/modprobe.d/ib_core.conf
    
    ~# rdma system
    netns exclusive copy-on-fork on
    

    (Optional) in an SR-IOV scenario, applications can enable NVIDIA's GPUDirect RDMA feature. For instructions on installing the kernel module, please refer to the official documentation.

  2. Install Spiderpool

    • set the values --set sriov.install=true

    • If you are a user from China, you can specify the parameter --set global.imageRegistryOverride=ghcr.m.daocloud.io to pull image from china registry.

    Once the installation is complete, the following components will be installed: ~# kubectl get pod -n kube-system spiderpool-agent-9sllh 1/1 Running 0 1m spiderpool-agent-h92bv 1/1 Running 0 1m spiderpool-controller-7df784cdb7-bsfwv 1/1 Running 0 1m spiderpool-sriov-operator-65b59cd75d-89wtg 1/1 Running 0 1m spiderpool-init 0/1 Completed 0 1m

  3. Configure SR-IOV operator

    To enable the SR-IOV CNI on specific nodes, you need to apply the following command to label those nodes. This will allow the sriov-network-operator to install the components on the designated nodes.

    kubectl label node $NodeName node-role.kubernetes.io/worker=""
    

    Look up the device information of the RoCE interface. Enter the following command to get NIC vendors 15b3 and deviceIDs 1017

    ~# lspci -nn | grep Ethernet
    af:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    af:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    

    By the way, the number of VFs determines how many SR-IOV network cards can be provided for PODs on a host. The network card from different manufacturers have different amount limit of VFs. For example, the Mellanox connectx5 used in this example can create up to 127 VFs.

    Apply the following configuration, and the VFs will be created on the host. Notice, this may cause the nodes to reboot, owing to taking effect the new configuration in the network card driver.

    cat <<EOF | kubectl apply -f -
    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: roce-sriov
      namespace: kube-system
    spec:
      nodeSelector:
        kubernetes.io/os: "linux"
      resourceName: mellanoxroce
      priority: 99
      numVfs: 12
      nicSelector:
          deviceID: "1017"
          rootDevices:
          - 0000:af:00.0
          vendor: "15b3"
      deviceType: netdevice
      isRdma: true
    EOF
    

    Verify the available resources on the node, including the reported SR-IOV device resources:

    ~# kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]'
    [
      {
        "name": "10-20-1-10",
        "allocable": {
          "cpu": "40",
          "pods": "110",
          "spidernet.io/mellanoxroce": "12",
          ...
        }
      },
      ...
    ]
    
  4. Create macvlan CNI configuration and corresponding ippool resources.

    cat <<EOF | kubectl apply -f -
    apiVersion: spiderpool.spidernet.io/v2beta1
    kind: SpiderIPPool
    metadata:
      name: v4-81
    spec:
      gateway: 172.81.0.1
      ips:
      - 172.81.0.100-172.81.0.120
      subnet: 172.81.0.0/16
    ---
    apiVersion: spiderpool.spidernet.io/v2beta1
    kind: SpiderMultusConfig
    metadata:
      name: roce-sriov
      namespace: kube-system
    spec:
      cniType: sriov
      sriov:
        resourceName: spidernet.io/mellanoxroce
        enableRdma: true
        ippools:
          ipv4: ["v4-81"]
    EOF
    
  5. Following the configurations from the previous step, create a DaemonSet application that spans across nodes for testing

    ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/roce-sriov"
    RESOURCE="spidernet.io/mellanoxroce"
    NAME=rdma-sriov
    cat <<EOF | kubectl apply -f -
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: ${NAME}
      labels:
        app: $NAME
    spec:
      selector:
        matchLabels:
          app: $NAME
      template:
        metadata:
          name: $NAME
          labels:
            app: $NAME
          annotations:
            ${ANNOTATION_MULTUS}
        spec:
          containers:
          - image: docker.io/mellanox/rping-test
            imagePullPolicy: IfNotPresent
            name: mofed-test
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            resources:
              limits:
                ${RESOURCE}: 1
            command:
            - sh
            - -c
            - |
              ls -l /dev/infiniband /sys/class/net
              sleep 1000000
    EOF
    
  6. Verify that RDMA communication is correct between the Pods across nodes.

    Open a terminal and access one Pod to launch a service:

    # Only one RDMA device allocated to the Pod can be found
    ~# rdma link
    7/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev eth0
    
    # launch an RDMA service
    ~# ib_read_lat
    

    Open a terminal and access another Pod to launch a service:

    # You are able to see all the RDMA cards on the host machine
    ~# rdma link
    10/1: mlx5_5/1: state ACTIVE physical_state LINK_UP netdev eth0
    
    # Access the service running in the other Pod
    ~# ib_read_lat 172.81.0.118
    libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_4'.
    libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_2'.
    libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_0'.
    libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_3'.
    libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_1'.
    ---------------------------------------------------------------------------------------
                        RDMA_Read Latency Test
     Dual-port       : OFF    Device         : mlx5_5
     Number of qps   : 1    Transport type : IB
     Connection type : RC   Using SRQ      : OFF
     TX depth        : 1
     Mtu             : 1024[B]
     Link type       : Ethernet
     GID index       : 2
     Outstand reads  : 16
     rdma_cm QPs   : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0000 QPN 0x0b69 PSN 0xd476c2 OUT 0x10 RKey 0x006f00 VAddr 0x00000001f91000
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:105
     remote address: LID 0000 QPN 0x0d69 PSN 0xbe5c89 OUT 0x10 RKey 0x004f00 VAddr 0x0000000160d000
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:118
    ---------------------------------------------------------------------------------------
     #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
    Conflicting CPU frequency values detected: 2200.000000 != 1338.151000. CPU Frequency is not max.
    Conflicting CPU frequency values detected: 2200.000000 != 2881.668000. CPU Frequency is not max.
     2       1000          6.66           20.37        6.74              6.82         0.78      7.15        20.37
    ---------------------------------------------------------------------------------------