Skip to content

RDMA

English简体中文

Introduction

Spiderpool employs macvlan, ipvlan, and SR-IOV CNI to expose RDMA network cards on the host machine for Pod. This page provides an overview of how to utilize RDMA network cards in Spiderpool.

Features

RDMA devices' network namespaces have two modes: shared and exclusive. Containers can either share or exclusively access RDMA network cards. In Kubernetes, shared cards can be utilized with macvlan or ipvlan CNI, while the exclusive one can be used with SR-IOV CNI.

In shared mode, Spiderpool leverages macvlan or ipvlan CNI to expose RoCE network cards on the host machine for Pod. The RDMA shared device plugin is employed for exposing RDMA card resources and scheduling Pods.

In exclusive mode, Spiderpool utilizes SR-IOV CNI to expose RDMA cards on the host machine for Pods, providing access to RDMA resources. RDMA CNI is used to ensure isolation of RDMA devices.

Shared usage of RoCE-capable NIC with macvlan or ipvlan

The following steps demonstrate how to enable shared usage of RDMA devices by Pods in a cluster with two nodes via macvlan CNI:

  1. Ensure that the host machine has an RDMA card installed and the driver is properly installed, ensuring proper RDMA functioning.

    In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 NIC with RoCE capabilities. Follow the official NVIDIA guide to install the latest OFED driver. To confirm the presence of RDMA devices, use the following command:

    ~# rdma link show
    link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens6f0np0
    link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens6f1np1
    

Make sure that the RDMA subsystem on the host is operating in shared mode. If not, switch to shared mode.

    ~# rdma system
    netns shared copy-on-fork on

    # switch to shared mode
    ~# rdma system set netns shared
  1. Verify the details of the RDMA card for subsequent device resource discovery by the device plugin.

    Enter the following command with NIC vendors being 15b3 and its deviceIDs being 1017:

    ~# lspci -nn | grep Ethernet
    af:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    af:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    
  2. Install Spiderpool and configure sriov-network-operator:

    helm install spiderpool spiderpool/spiderpool -n kube-system \
       --set multus.multusCNI.defaultCniCRName="macvlan-ens6f0np0" \
       --set rdma.rdmaSharedDevicePlugin.install=true \
       --set rdma.rdmaSharedDevicePlugin.deviceConfig.resourcePrefix="spidernet.io" \
       --set rdma.rdmaSharedDevicePlugin.deviceConfig.resourceName="hca_shared_devices" \
       --set rdma.rdmaSharedDevicePlugin.deviceConfig.rdmaHcaMax=500 \
       --set rdma.rdmaSharedDevicePlugin.deviceConfig.vendors="15b3" \
       --set rdma.rdmaSharedDevicePlugin.deviceConfig.deviceIDs="1017"
    
    • If Macvlan is not installed in your cluster, you can specify the Helm parameter --set plugins.installCNI=true to install Macvlan in your cluster.

    • If you are a user from China, you can specify the parameter --set global.imageRegistryOverride=ghcr.m.daocloud.io to avoid image pull failures from Spiderpool.

    • After completing the installation of Spiderpool, you can manually edit the spiderpool-rdma-shared-device-plugin configmap to reconfigure the RDMA shared device plugin.

    • Specify the name of the NetworkAttachmentDefinition instance for the default CNI used by Multus via multus.multusCNI.defaultCniCRName. If the multus.multusCNI.defaultCniCRName option is provided, an empty NetworkAttachmentDefinition instance will be automatically generated upon installation. Otherwise, Multus will attempt to create a NetworkAttachmentDefinition instance based on the first CNI configuration found in the /etc/cni/net.d directory. If no suitable configuration is found, a NetworkAttachmentDefinition instance named default will be created to complete the installation of Multus.

    Once the installation is complete, the following components will be installed:

    ~# kubectl get pod -n kube-system
    spiderpool-agent-9sllh                         1/1     Running     0          1m
    spiderpool-agent-h92bv                         1/1     Running     0          1m
    spiderpool-controller-7df784cdb7-bsfwv         1/1     Running     0          1m
    spiderpool-init                                0/1     Completed   0          1m
    spiderpool-rdma-shared-device-plugin-dr7w8     1/1     Running     0          1m
    spiderpool-rdma-shared-device-plugin-zj65g     1/1     Running     0          1m
    
  3. View the available resources on a node, including the reported RDMA device resources:

    ~# kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]'
      [
        {
          "name": "10-20-1-10",
          "allocable": {
            "cpu": "40",
            "memory": "263518036Ki",
            "pods": "110",
            "spidernet.io/hca_shared_devices": "500",
            ...
          }
        },
        ...
      ]
    

    If the reported resource count is 0, it may be due to the following reasons:

    (1) Verify that the vendors and deviceID in the spiderpool-rdma-shared-device-plugin configmap match the actual values.

    (2) Check the logs of the rdma-shared-device-plugin. If you encounter errors related to RDMA NIC support, try installing apt-get install rdma-core or dnf install rdma-core on the host machine.

    error creating new device: "missing RDMA device spec for device 0000:04:00.0, RDMA device \"issm\" not found"

  4. Create macvlan-related multus configurations using an RDMA card as the master node and set up the corresponding ippool resources:

    cat <<EOF | kubectl apply -f -
    apiVersion: spiderpool.spidernet.io/v2beta1
    kind: SpiderIPPool
    metadata:
      name: v4-81
    spec:
      gateway: 172.81.0.1
      ips:
      - 172.81.0.100-172.81.0.120
      subnet: 172.81.0.0/16
    ---
    apiVersion: spiderpool.spidernet.io/v2beta1
    kind: SpiderMultusConfig
    metadata:
      name: macvlan-ens6f0np0
      namespace: kube-system
    spec:
      cniType: macvlan
      macvlan:
        master:
        - "ens6f0np0"
        ippools:
          ipv4: ["v4-81"]
    EOF
    
  5. Following the configurations from the previous step, create a DaemonSet application that spans across nodes:

    ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/macvlan-ens6f0np0"
    RESOURCE="spidernet.io/hca_shared_devices"
    NAME=rdma-macvlan
    cat <<EOF | kubectl apply -f -
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: ${NAME}
      labels:
        app: $NAME
    spec:
      selector:
        matchLabels:
          app: $NAME
      template:
        metadata:
          name: $NAME
          labels:
            app: $NAME
          annotations:
            ${ANNOTATION_MULTUS}
        spec:
          containers:
          - image: docker.io/mellanox/rping-test
            imagePullPolicy: IfNotPresent
            name: mofed-test
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            resources:
              limits:
                ${RESOURCE}: 1
            command:
            - sh
            - -c
            - |
              ls -l /dev/infiniband /sys/class/net
              sleep 1000000
    EOF
    
  6. Verify that RDMA data transmission is working correctly between the Pods across nodes.

    Open a terminal and access one Pod to launch a service:

    # You are able to see all the RDMA cards on the host machine
    ~# rdma link
    0/1: mlx5_0/1: state ACTIVE physical_state LINK_UP
    1/1: mlx5_1/1: state ACTIVE physical_state LINK_UP
    
    # Start an RDMA service
    ~# ib_read_lat
    

    Open a terminal and access another Pod to launch a service:

    # You are able to see all the RDMA cards on the host machine
    ~# rdma link
    0/1: mlx5_0/1: state ACTIVE physical_state LINK_UP
    1/1: mlx5_1/1: state ACTIVE physical_state LINK_UP
    
    # Access the service running in the other Pod
    ~# ib_read_lat 172.81.0.120
    ---------------------------------------------------------------------------------------
                        RDMA_Read Latency Test
     Dual-port       : OFF    Device         : mlx5_0
     Number of qps   : 1    Transport type : IB
     Connection type : RC   Using SRQ      : OFF
     TX depth        : 1
     Mtu             : 1024[B]
     Link type       : Ethernet
     GID index       : 12
     Outstand reads  : 16
     rdma_cm QPs   : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0000 QPN 0x0107 PSN 0x79dd10 OUT 0x10 RKey 0x1fddbc VAddr 0x000000023bd000
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:119
     remote address: LID 0000 QPN 0x0107 PSN 0x40001a OUT 0x10 RKey 0x1fddbc VAddr 0x00000000bf9000
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:120
    ---------------------------------------------------------------------------------------
     #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
    Conflicting CPU frequency values detected: 2200.000000 != 1040.353000. CPU Frequency is not max.
    Conflicting CPU frequency values detected: 2200.000000 != 1849.351000. CPU Frequency is not max.
     2       1000          6.88           16.81        7.04              7.06         0.31      7.38        16.81
    ---------------------------------------------------------------------------------------
    

Isolated usage of RoCE-capable NIC with SR-IOV

The following steps demonstrate how to enable isolated usage of RDMA devices by Pods in a cluster with two nodes via SR-IOV CNI:

  1. Ensure that the host machine has an RDMA and SR-IOV enabled card and the driver is properly installed, ensuring proper RDMA functioning.

    In our demo environment, the host machine is equipped with a Mellanox ConnectX-5 NIC with RoCE capabilities. Follow the official NVIDIA guide to install the latest OFED driver.

    To isolate the usage of an RDMA network card, ensure that at least one of the following conditions is met:

    (1) Kernel based on 5.3.0 or newer, RDMA modules loaded in the system. rdma-core package provides means to automatically load relevant modules on system start

    (2) Mellanox OFED version 4.7 or newer is required. In this case it is not required to use a Kernel based on 5.3.0 or newer.

    To confirm the presence of RDMA devices, use the following command:

    ~# rdma link show
    link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens6f0np0
    link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens6f1np1
    

    Make sure that the RDMA subsystem on the host is operating in exclusive mode. If not, switch to exclusive mode.

    # switch to exclusive mode and fail to restart the host 
    ~# rdma system set netns exclusive
    # apply persistent settings: 
    ~# echo "options ib_core netns_mode=0" >> /etc/modprobe.d/ib_core.conf
    
    ~# rdma system
    netns exclusive copy-on-fork on
    

    (Optional) in an SR-IOV scenario, applications can enable NVIDIA's GPUDirect RDMA feature. For instructions on installing the kernel module, please refer to the official documentation.

  2. Verify the details of the RDMA card for subsequent device resource discovery by the device plugin.

    Enter the following command with NIC vendors being 15b3 and its deviceIDs being 1017:

    ~# lspci -nn | grep Ethernet
    af:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    af:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
    
  3. Install Spiderpool

    helm install spiderpool spiderpool/spiderpool -n kube-system \
       --set sriov.install=true  \
       --set plugins.installRdmaCNI=true
    
    • If you are a user from China, you can specify the parameter --set global.imageRegistryOverride=ghcr.m.daocloud.io to avoid image pull failures from Spiderpool.

    • After completing the installation of Spiderpool, you can manually edit the spiderpool-rdma-shared-device-plugin configmap to reconfigure the RDMA shared device plugin.

    • Specify the name of the NetworkAttachmentDefinition instance for the default CNI used by Multus via multus.multusCNI.defaultCniCRName. If the multus.multusCNI.defaultCniCRName option is provided, an empty NetworkAttachmentDefinition instance will be automatically generated upon installation. Otherwise, Multus will attempt to create a NetworkAttachmentDefinition instance based on the first CNI configuration found in the /etc/cni/net.d directory. If no suitable configuration is found, a NetworkAttachmentDefinition instance named default will be created to complete the installation of Multus.

    Once the installation is complete, the following components will be installed: ~# kubectl get pod -n kube-system spiderpool-agent-9sllh 1/1 Running 0 1m spiderpool-agent-h92bv 1/1 Running 0 1m spiderpool-controller-7df784cdb7-bsfwv 1/1 Running 0 1m spiderpool-init 0/1 Completed 0 1m

  4. Configure SR-IOV operator

    By the way, the number of VFs determines how many SR-IOV network cards can be provided for PODs on a host. The network card from different manufacturers have different amount limit of VFs. For example, the Mellanox connectx5 used in this example can create up to 127 VFs.

    Apply the following configuration, and the VFs will be created on the host. Notice, this may cause the nodes to reboot, owing to taking effect the new configuration in the network card driver.

    cat <<EOF | kubectl apply -f -
    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: policyrdma
      namespace: kube-system
    spec:
      nodeSelector:
        kubernetes.io/os: "linux"
      resourceName: mellanoxrdma
      priority: 99
      numVfs: 12
      nicSelector:
          deviceID: "1017"
          rootDevices:
          - 0000:af:00.0
          vendor: "15b3"
      deviceType: netdevice
      isRdma: true
    EOF
    

    Verify the available resources on the node, including the reported SR-IOV device resources:

    ~# kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]'
    [
      {
        "name": "10-20-1-10",
        "allocable": {
          "cpu": "40",
          "pods": "110",
          "spidernet.io/mellanoxrdma": "12",
          ...
        }
      },
      ...
    ]
    
  5. Create multus configurations related to SR-IOV and create corresponding ippool resources.

    cat <<EOF | kubectl apply -f -
    apiVersion: spiderpool.spidernet.io/v2beta1
    kind: SpiderIPPool
    metadata:
      name: v4-81
    spec:
      gateway: 172.81.0.1
      ips:
      - 172.81.0.100-172.81.0.120
      subnet: 172.81.0.0/16
    ---
    apiVersion: spiderpool.spidernet.io/v2beta1
    kind: SpiderMultusConfig
    metadata:
      name: sriov-rdma
      namespace: kube-system
    spec:
      cniType: sriov
      sriov:
        resourceName: spidernet.io/mellanoxrdma
        enableRdma: true
        ippools:
          ipv4: ["v4-81"]
    EOF
    
  6. Following the configurations from the previous step, create a DaemonSet application that spans across nodes:

    ANNOTATION_MULTUS="v1.multus-cni.io/default-network: kube-system/sriov-rdma"
    RESOURCE="spidernet.io/mellanoxrdma"
    NAME=rdma-sriov
    cat <<EOF | kubectl apply -f -
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: ${NAME}
      labels:
        app: $NAME
    spec:
      selector:
        matchLabels:
          app: $NAME
      template:
        metadata:
          name: $NAME
          labels:
            app: $NAME
          annotations:
            ${ANNOTATION_MULTUS}
        spec:
          containers:
          - image: docker.io/mellanox/rping-test
            imagePullPolicy: IfNotPresent
            name: mofed-test
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            resources:
              limits:
                ${RESOURCE}: 1
            command:
            - sh
            - -c
            - |
              ls -l /dev/infiniband /sys/class/net
              sleep 1000000
    EOF
    
  7. Verify that RDMA data transmission is working correctly between the Pods across nodes.

    Open a terminal and access one Pod to launch a service:

    # Only one RDMA device allocated to the Pod can be found
    ~# rdma link
    7/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev eth0
    
    # launch an RDMA service
    ~# ib_read_lat
    

    Open a terminal and access another Pod to launch a service:

    # You are able to see all the RDMA cards on the host machine
    ~# rdma link
    10/1: mlx5_5/1: state ACTIVE physical_state LINK_UP netdev eth0
    
    # Access the service running in the other Pod
    ~# ib_read_lat 172.81.0.118
    libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_4'.
    libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_2'.
    libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_0'.
    libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_3'.
    libibverbs: Warning: couldn't stat '/sys/class/infiniband/mlx5_1'.
    ---------------------------------------------------------------------------------------
                        RDMA_Read Latency Test
     Dual-port       : OFF    Device         : mlx5_5
     Number of qps   : 1    Transport type : IB
     Connection type : RC   Using SRQ      : OFF
     TX depth        : 1
     Mtu             : 1024[B]
     Link type       : Ethernet
     GID index       : 2
     Outstand reads  : 16
     rdma_cm QPs   : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0000 QPN 0x0b69 PSN 0xd476c2 OUT 0x10 RKey 0x006f00 VAddr 0x00000001f91000
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:105
     remote address: LID 0000 QPN 0x0d69 PSN 0xbe5c89 OUT 0x10 RKey 0x004f00 VAddr 0x0000000160d000
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:81:00:118
    ---------------------------------------------------------------------------------------
     #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
    Conflicting CPU frequency values detected: 2200.000000 != 1338.151000. CPU Frequency is not max.
    Conflicting CPU frequency values detected: 2200.000000 != 2881.668000. CPU Frequency is not max.
     2       1000          6.66           20.37        6.74              6.82         0.78      7.15        20.37
    ---------------------------------------------------------------------------------------