EgressGateway Failover
Controller Failover
When the EgressGateway controller fails over, you can control the number of Controller replicas by specifying the controller.replicas parameter during installation. If one of the replicas in multiple Controller replicas fails, the system will automatically elect another replica as the primary controller to ensure service continuity.
Datapath Failover
When handling datapath failover, creating an EgressGateway can use nodeSelector to select a set of nodes as Egress Nodes. The Egress IP will be bound to one of these nodes. When a node fails or the Egress Agent on a node fails, the Egress IP will automatically move to another available node to ensure service continuity and reliability.
apiVersion: egressgateway.spidernet.io/v1beta1
kind: EgressGateway
metadata: 
  name: egw1
spec:
  clusterDefault: true
  ippools:
    ipv4:
      - 10.6.1.55
      - 10.6.1.56 
    ipv4DefaultEIP: 10.6.1.56
    ipv6:
      - fd00::55
      - fd00::56
    ipv6DefaultEIP: fd00::55
  nodeSelector:
    selector:
      matchLabels:
        egress: "true"
status:
  nodeList:
    - name: node1
      status: Ready
      eips:
        - ipv4: 10.6.1.56
          ipv6: fd00::55
          policies:
            - name: policy1
              namespace: default
    - name: node2
      status: Ready
In the above definition of EgressGateway, by setting egress: "true", two nodes, node1 and node2, are designated as Egress Nodes. Node1 is the one selected as the active node, and its effective Egress IP can be viewed in the status. If node1 encounters a failure, then node2 will serve as the failover node.
The timeout for health checks and Egress IP failover can be tuned via Helm values configuration.
- feature.tunnelMonitorPeriodThe egress controller check tunnel last update status at an interval set in seconds, default- 5.
- feature.tunnelUpdatePeriodThe egress agent updates the tunnel status at an interval set in seconds, default- 5.
- feature.eipEvictionTimeoutIf the last updated time of the egress tunnel exceeds this time, move the Egress IP of the node to an available node, the unit is seconds, default is- 5.
apiVersion: egressgateway.spidernet.io/v1beta1
kind: EgressTunnel
metadata:
  name: workstation1
spec: {}
status:
  lastHeartbeatTime: "2023-11-27T12:04:56Z"
  mark: "0x26d9b723"
  phase: Ready
The EgressGateway Agent will periodically update the status.lastHeartbeatTime field at intervals set by feature.tunnelUpdatePeriod. The EgressGateway Controller, on the other hand, will periodically list all EgressTunnels using feature.tunnelMonitorPeriod, and check whether the sum of status.lastHeartbeatTime and feature.eipEvictionTimeout exceeds the current time.
Datapath Failover troubleshooting steps:
- First, check the installation configuration file values.yamlof the EgressGateway application to ensure failover related configurations are set reasonably, in particular ensuringeipEvictionTimeoutis greater than the sum oftunnelMonitorPeriodandtunnelUpdatePeriod.
- Execute kubectl get egt -wto check the status ofEgressTunnel. Check if the selected Node is inHeartbeatTimeoutstate, and if there are otherEgressTunnelinReadystate.
- If you want to check if there has been an IP switch caused by HeartbeatTimeout, you can retrieve the logs related to update tunnel status to HeartbeatTimeoutin the controller container.