Daniele Moro | 5212da6 | 2021-10-11 16:20:26 +0200 | [diff] [blame] | 1 | .. _troubleshooting_guide: |
| 2 | |
Charles Chan | caebcf3 | 2021-09-20 22:17:52 -0700 | [diff] [blame] | 3 | Troubleshooting Guide |
| 4 | ===================== |
Charles Chan | bf55e74 | 2021-10-04 17:46:46 -0700 | [diff] [blame] | 5 | |
pierventre | 517cd53 | 2021-10-12 22:58:00 +0200 | [diff] [blame] | 6 | In this section we are going to provide hints and useful commands to help you troubleshoot traffic-related problems |
| 7 | or k8s related issues. It is important to remember that these two types of issues are highly related as both |
| 8 | control plane software and data plane software are containerized and deployed as Kubernetes services in SD-Fabric. |
| 9 | Please refer to :ref:`architecture_design` for further details. |
| 10 | |
| 11 | K8s troubleshooting |
| 12 | ------------------- |
| 13 | |
| 14 | We assume that the tool ``kubectl`` have been install already on your local machine. |
| 15 | First step is to setup the proper ``kubeconfig`` file to access the k8s cluster you want to troubleshoot: |
| 16 | |
| 17 | .. code-block:: |
| 18 | |
| 19 | $ export KUBECONFIG=~/kubeconfig/dev-sdfabric-menlo |
| 20 | $ kubectl config use-context dev-sdfabric-menlo |
| 21 | Switched to context "dev-sdfabric-menlo". |
| 22 | |
| 23 | You can get the list of the k8s namespaces using ``kubectl get`` command: |
| 24 | |
| 25 | .. code-block:: |
| 26 | |
| 27 | $ kubectl get namespaces |
| 28 | ... |
| 29 | kube-node-lease Active 68d |
| 30 | kube-public Active 68d |
| 31 | kube-system Active 68d |
| 32 | security-scan Active 68d |
| 33 | sdfabric Active 26h |
| 34 | |
| 35 | Let's assume that SD-Fabric resources are deployed under the namespace ``sdfabric``, so make sure that the ``sdfabric`` |
| 36 | namespace has been properly created (additionally other namespaces could be created - please check your overarching chart). |
| 37 | |
| 38 | If the deployment is not successful, |
| 39 | a first check is to make sure there are enough available nodes in the target cluster. |
| 40 | You can check the available nodes through ``kubectl get nodes`` command: |
| 41 | |
| 42 | .. code-block:: |
| 43 | |
| 44 | $ kubectl get nodes -o wide |
| 45 | NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME |
| 46 | compute1 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.74 <none> Ubuntu 18.04.6 LTS 5.4.0-73-generic docker://20.10.9 |
| 47 | compute2 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.72 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15 |
| 48 | compute3 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.68 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15 |
| 49 | leaf1 Ready worker 39d v1.18.8 10.76.28.70 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15 |
| 50 | leaf2 Ready worker 39d v1.18.8 10.76.28.71 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15 |
| 51 | |
| 52 | You should have at least `3+N` available nodes, where N depends on the deployed network topology. Please note that ONOS |
| 53 | cannot be scheduled on the network devices (these are special worker nodes), and different ONOS cannot share the same worker |
| 54 | node (the same applies for Atomix). |
| 55 | |
| 56 | At least you should have some basic containers that are present in each deployment. |
| 57 | You can get the list of the pods by using ``kubectl get pods -n sdfabric``: |
| 58 | |
| 59 | .. code-block:: |
| 60 | |
| 61 | $ kubectl get pods -n sdfabric -o wide |
| 62 | NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES |
| 63 | onos-tost-atomix-0 1/1 Running 0 6h31m 10.72.106.161 compute3 <none> <none> |
| 64 | onos-tost-atomix-1 1/1 Running 0 6h31m 10.72.111.229 compute1 <none> <none> |
| 65 | onos-tost-atomix-2 1/1 Running 0 6h31m 10.72.75.254 compute2 <none> <none> |
| 66 | onos-tost-onos-classic-0 1/1 Running 0 98m 10.72.106.133 compute3 <none> <none> |
| 67 | onos-tost-onos-classic-1 1/1 Running 0 6h31m 10.72.111.207 compute1 <none> <none> |
| 68 | onos-tost-onos-classic-2 1/1 Running 0 6h31m 10.72.75.247 compute2 <none> <none> |
| 69 | onos-tost-onos-classic-onos-config-loader-ddc9d68bb-lq97t 1/1 Running 0 6h19m 10.72.106.190 compute3 <none> <none> |
| 70 | stratum-bwlvh 1/1 Running 0 6h31m 10.76.28.70 leaf1 <none> <none> |
| 71 | stratum-gh842 1/1 Running 0 6h31m 10.76.28.71 leaf2 <none> <none> |
| 72 | |
| 73 | 3 Atomix nodes and 3 ONOS nodes are needed for HA. `onos-config-loader` is equally important, because without ONOS |
| 74 | cannot be properly configured. The number of Stratum pods depend on the deployed topology. If the status of the pods |
| 75 | is not `Running` you can check the events published by k8s components to have a first idea of what is happening: |
| 76 | |
| 77 | .. code-block:: |
| 78 | |
| 79 | $ kubectl get events -n sdfabric --sort-by='.lastTimestamp' |
| 80 | LAST SEEN TYPE REASON OBJECT MESSAGE |
| 81 | 12m Normal Scheduled pod/telegraf-75b959574d-sl8qb Successfully assigned tost/telegraf-75b959574d-sl8qb to compute3 |
| 82 | 12m Normal SuccessfulCreate replicaset/telegraf-75b959574d Created pod: telegraf-75b959574d-sl8qb |
| 83 | 12m Normal ScalingReplicaSet deployment/telegraf Scaled up replica set telegraf-75b959574d to 1 |
| 84 | 12m Normal Pulled pod/telegraf-75b959574d-sl8qb Container image "telegraf:1.17" already present on machine |
| 85 | 12m Normal AddedInterface pod/telegraf-75b959574d-sl8qb Add eth0 [10.72.106.153/32] |
| 86 | 12m Normal Started pod/telegraf-75b959574d-sl8qb Started container telegraf |
| 87 | 12m Normal Created pod/telegraf-75b959574d-sl8qb Created container telegraf |
| 88 | ... |
| 89 | |
| 90 | The option ``--sort-by='.lastTimestamp'`` is typically used to get the events sorted by time. The previous command |
| 91 | will report all the events happened in the ``sdfabric`` namespace, if you want to have more insights on a specific |
| 92 | pod, it is possible to use the command ``kubectl describe pods``: |
| 93 | |
| 94 | .. code-block:: |
| 95 | |
| 96 | $ kubectl describe pods -n sdfabric onos-tost-onos-classic-0 |
| 97 | Name: onos-tost-onos-classic-0 |
| 98 | Namespace: sdfabric |
| 99 | Priority: 0 |
| 100 | Node: compute3/10.76.28.68 |
| 101 | Start Time: Mon, 11 Oct 2021 10:35:43 +0200 |
| 102 | ... |
| 103 | Events: |
| 104 | Type Reason Age From Message |
| 105 | ... |
| 106 | {"message":"pending"} |
| 107 | org.onosproject.segmentrouting is not yet ready |
| 108 | |
| 109 | The ``Events`` section provides typically useful information about the issues the pod is facing. |
| 110 | |
| 111 | Both ONOS and Atomix define readiness probes which will make sure that the pods are ready before any configuration |
| 112 | will take place. As consequence of this, if the probes fail for a given pod you will notice in the output of the command |
| 113 | ``kubectl get pods``` near its name ``0/1`` under the column ``READY``. We report in `ONOS pod not ready (1)`_ and |
| 114 | `ONOS pod not ready (2)`_ two scenarios frequently faced by the SD-Fabric developers. |
| 115 | |
| 116 | Logs of the SD-Fabric pods can be accessed by using ``kubectl logs`` command |
| 117 | |
| 118 | .. code-block:: |
| 119 | |
| 120 | $ kubectl -n sdfabric logs onos-tost-onos-classic-0 |
| 121 | 2021-10-12 04:46:17,955 INFO [EventAdminConfigurationNotifier] Sending Event Admin notification (configuration successful) to org/ops4j/pax/logging/Configuration |
| 122 | ... |
| 123 | 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Changes to perform: |
| 124 | 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Region: root |
| 125 | 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Bundles to install: |
| 126 | |
| 127 | |
| 128 | ONOS Troubleshooting |
| 129 | -------------------- |
| 130 | |
| 131 | You can get the ONOS CLI by establishing SSH connection to the port ``8101`` (default password is `karaf`): |
| 132 | |
| 133 | .. code-block:: |
| 134 | |
pierventre | 16cc802 | 2021-10-14 10:34:57 +0200 | [diff] [blame] | 135 | $ kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101 |
pierventre | 517cd53 | 2021-10-12 22:58:00 +0200 | [diff] [blame] | 136 | // In another terminal or you can send to /dev/null the port-forward |
| 137 | $ ssh -p 8101 karaf@localhost |
| 138 | The authenticity of host '[localhost]:8101 ([127.0.0.1]:8101)' can't be established. |
| 139 | RSA key fingerprint is SHA256:Mlaax9tHmIR6WwK0B3okC1O4mpAuoXjI7Z5+KKelxOo. |
| 140 | Are you sure you want to continue connecting (yes/no)? yes |
| 141 | Warning: Permanently added '[localhost]:8101' (RSA) to the list of known hosts. |
| 142 | Password authentication |
| 143 | Password: |
| 144 | Welcome to Open Network Operating System (ONOS)! |
| 145 | ____ _ ______ ____ |
| 146 | / __ \/ |/ / __ \/ __/ |
| 147 | / /_/ / / /_/ /\ \ |
| 148 | \____/_/|_/\____/___/ |
| 149 | |
| 150 | Documentation: wiki.onosproject.org |
| 151 | Tutorials: tutorials.onosproject.org |
| 152 | Mailing lists: lists.onosproject.org |
| 153 | |
| 154 | Come help out! Find out how at: contribute.onosproject.org |
| 155 | |
| 156 | Hit '<tab>' for a list of available commands |
| 157 | and '[cmd] --help' for help on a specific command. |
| 158 | Hit '<ctrl-d>' or type 'logout' to exit ONOS session. |
| 159 | |
| 160 | karaf@root > |
| 161 | |
| 162 | Alternatively, if this is not possible to establish an ssh connection with the ONOS pods, |
| 163 | it is possible to use ``kubectl exec`` command on the target pod: |
| 164 | |
| 165 | .. code-block:: |
| 166 | |
pierventre | 16cc802 | 2021-10-14 10:34:57 +0200 | [diff] [blame] | 167 | $ kubectl -n sdfabric exec -it onos-tost-onos-classic-0 -- bash apache-karaf-4.2.9/bin/client |
pierventre | 517cd53 | 2021-10-12 22:58:00 +0200 | [diff] [blame] | 168 | Welcome to Open Network Operating System (ONOS)! |
| 169 | ____ _ ______ ____ |
| 170 | / __ \/ |/ / __ \/ __/ |
| 171 | / /_/ / / /_/ /\ \ |
| 172 | \____/_/|_/\____/___/ |
| 173 | |
| 174 | Documentation: wiki.onosproject.org |
| 175 | Tutorials: tutorials.onosproject.org |
| 176 | Mailing lists: lists.onosproject.org |
| 177 | |
| 178 | Come help out! Find out how at: contribute.onosproject.org |
| 179 | |
| 180 | Hit '<tab>' for a list of available commands |
| 181 | and '[cmd] --help' for help on a specific command. |
| 182 | Hit '<ctrl-d>' or type 'logout' to exit ONOS session. |
| 183 | |
| 184 | karaf@root |
| 185 | |
| 186 | You can attach to the ONOS logs by using the ``log:tail`` command: |
| 187 | |
| 188 | .. code-block:: |
| 189 | |
| 190 | $ karaf@root > log:tail |
| 191 | 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine1 -> device:leaf1 |
| 192 | 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine2 -> device:leaf1 |
| 193 | 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf1 -> device:spine1 |
| 194 | 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf2 -> device:spine1 |
| 195 | |
| 196 | The command will display continuously the log entries - this is useful for a live debugging session. |
| 197 | Complete ONOS logs can be accessed by using ``kubectl logs`` command as explained in the previous section. |
| 198 | If anything can be figured out from the logs, you can access |
| 199 | to the ONOS state by issuing specific CLI commands. We report in the section `Frequently Used Commands`_ few commands we frequently use |
| 200 | when troubleshooting SD-Fabric. |
| 201 | |
pierventre | 16cc802 | 2021-10-14 10:34:57 +0200 | [diff] [blame] | 202 | Pipeline Walk-through |
| 203 | ^^^^^^^^^^^^^^^^^^^^^ |
| 204 | .. note:: |
| 205 | More information of Pipeline Walk-through is coming soon |
| 206 | |
pierventre | 517cd53 | 2021-10-12 22:58:00 +0200 | [diff] [blame] | 207 | onos-diagnostics |
| 208 | ^^^^^^^^^^^^^^^^ |
| 209 | |
| 210 | In the case where you can't figure out what is going wrong, you can seek help on SD-Fabric developer mailing list |
| 211 | ``sdfabric-dev@opennetworking.org`` or you can reach out on the ``sdfabric-dev`` Slack channel. There are a few |
| 212 | things we would like you to attach: |
| 213 | |
| 214 | - **Issue description** |
| 215 | |
| 216 | - **Environment description**, such as SD-Fabric version, switch model and SDE version |
| 217 | version |
| 218 | |
| 219 | - **Steps of reproduction**, as detail as possible |
| 220 | |
| 221 | - **Diagnostics**. |
| 222 | |
| 223 | We have built a tool `onos-diagnostics-k8s <https://wiki.onosproject.org/display/ONOS/ONOS+Remote+Admin+Tools>`_ |
pierventre | 16cc802 | 2021-10-14 10:34:57 +0200 | [diff] [blame] | 224 | to help you easily collect and package ONOS diagnostics. The tool collects various information from the running |
| 225 | ONOS cluster and packages it into one, easy-to-share archive file. This tool is distributed as part of the ONOS |
| 226 | software itself (under bin directory), but is also available as part of a small archive of remote tools to administer |
| 227 | an ONOS cluster (`onos-admin-\*.tar.gz`). |
| 228 | |
| 229 | Alternatively, it is possible to use ``onos-diagnostics-k8s`` in Kubernetes enabled environments. The tool will produce |
| 230 | the same results of onos-diagnostics and relies only on ``kubectl`` commands. The tool need to know the name of |
| 231 | the namespace and this can be provided through the option ``-s``. Then, you have to provide the names of the target |
| 232 | pods. To avoid having to specify these names as part of the command, you can export the ``ONOS_PODS`` environment |
| 233 | variable. Here’s an example of how to set the variable: |
| 234 | |
| 235 | .. code-block:: |
| 236 | |
| 237 | $ export ONOS_PODS="onos-0 onos-1 onos-2" |
| 238 | |
| 239 | The tool needs to know the Karaf home (path from the mount point). To avoid having to specify this path as part |
| 240 | of the command, you can export the ``KARAF_HOME`` environment variable: |
| 241 | |
| 242 | .. code-block:: |
| 243 | |
| 244 | $ export KARAF_HOME="apache-karaf-4.2.9" |
| 245 | |
| 246 | Once done, the ``onos-diagnostics-k8s`` tool can be run as follows: |
| 247 | |
| 248 | .. code-block:: |
| 249 | |
| 250 | $ onos-diagnostics-k8s -s sdfabric |
| 251 | |
| 252 | There is the option ``-n`` that allows for naming the resulting archive file for differentiation between different |
| 253 | cluster instances, e.g. |
| 254 | |
| 255 | .. code-block:: |
| 256 | |
| 257 | # This will produce archive file /tmp/delta-pod-diags.tar.gz |
| 258 | $ onos-diagnostics-k8s -s sdfabric -n delta-pod |
| 259 | |
| 260 | By default ``onos-diagnostics-k8s`` will use ``ONOS_PROFILE`` to collect the diagnostics, you can tailor the behavior of the |
| 261 | command to your needs by specifying a different `profile <https://github.com/opennetworkinglab/onos/blob/master/tools/package/runtime/bin/onos-diagnostics-profile>`_. |
| 262 | For SD-Fabric we suggest to use ``TRELLIS_PROFILE``. The resulting `/tmp/\*-diags.tar.gz` file will contain all |
| 263 | relevant information about the ONOS cluster. |
| 264 | |
| 265 | The following is an example of a complete ``onos-diagnostics-k8s`` command: |
| 266 | |
| 267 | .. code-block:: |
| 268 | |
| 269 | $ DIAGS_PROFILE=TRELLIS_PROFILE onos-diagnostics-k8s -k apache-karaf-4.2.9 -s sdfabric onos-tost-onos-classic-0 onos-tost-onos-classic-1 onos-tost-onos-classic-2 |
pierventre | 517cd53 | 2021-10-12 22:58:00 +0200 | [diff] [blame] | 270 | |
| 271 | UP4 Troubleshooting |
| 272 | ------------------- |
| 273 | |
| 274 | .. note:: |
| 275 | More information of UP4 troubleshoot is coming soon |
| 276 | |
| 277 | Common Issues |
| 278 | ------------- |
| 279 | |
| 280 | .. note:: |
| 281 | Here is a list of common issues. |
| 282 | More details of each case are coming soon |
| 283 | |
| 284 | ImagePullBackOff |
| 285 | ^^^^^^^^^^^^^^^^ |
| 286 | |
| 287 | ONOS pod not ready (1) |
| 288 | ^^^^^^^^^^^^^^^^^^^^^^ |
| 289 | |
| 290 | ONOS pod not ready (2) |
| 291 | ^^^^^^^^^^^^^^^^^^^^^^ |
| 292 | |
| 293 | ONOS pods not configured |
| 294 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
| 295 | |
| 296 | Packet-In not working |
| 297 | ^^^^^^^^^^^^^^^^^^^^^ |
| 298 | |
| 299 | Device offline |
| 300 | ^^^^^^^^^^^^^^ |
| 301 | |
Charles Chan | bf55e74 | 2021-10-04 17:46:46 -0700 | [diff] [blame] | 302 | Frequently Used Commands |
| 303 | ------------------------ |
pierventre | 517cd53 | 2021-10-12 22:58:00 +0200 | [diff] [blame] | 304 | |
| 305 | In this subsection, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric. |
Charles Chan | bf55e74 | 2021-10-04 17:46:46 -0700 | [diff] [blame] | 306 | |
| 307 | ONOS |
| 308 | ^^^^ |
| 309 | To execute following ONOS CLI commands, |
| 310 | |
pierventre | 16cc802 | 2021-10-14 10:34:57 +0200 | [diff] [blame] | 311 | - Create K8s port forwarding by `kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101` |
Charles Chan | bf55e74 | 2021-10-04 17:46:46 -0700 | [diff] [blame] | 312 | - Login to ONOS CLI by `ssh -p 8101 karaf@localhost`. Default password is `karaf` |
| 313 | |
| 314 | ONOS basics |
| 315 | """"""""""" |
| 316 | - `flows`: List flow tables. `-s` for simplified output. |
| 317 | - `groups`: List group tables. `-s` for simplified output. |
| 318 | - `devices`: List device information. `-s` for simplified output. |
| 319 | - `ports`: List port information. `-e` to list enabled ports only. |
| 320 | - `links`: List discovered links |
| 321 | - `hosts`: List discovered hosts. `-s` for simplified output. |
| 322 | - `netcfg`: List network configuration |
| 323 | - `interfaces`: List interface configuration |
| 324 | |
| 325 | trellis-control |
| 326 | """"""""""""""" |
| 327 | - `sr-pr-list`: List current recovery phase of each device |
| 328 | - `sr-device-subnets`: List device-subnet mapping |
| 329 | |
| 330 | fabric-tna |
| 331 | """""""""" |
| 332 | - `slices`: List network slices |
| 333 | - `tcs`: List traffic classes of given slice |
| 334 | |
| 335 | up4 |
| 336 | """ |
| 337 | - `read-interfaces`: List all interfaces installed in the data plane |
| 338 | - `read-pdrs`: List all PDRs installed in the data plane |
| 339 | - `read-fars`: List all FARS installed in the data plane |
| 340 | - `read-flows`: List all UE data flows installed in the data plane) |
| 341 | |
| 342 | Stratum |
| 343 | ^^^^^^^ |
Carmelo Cascone | 4398998 | 2021-10-12 00:01:19 -0700 | [diff] [blame] | 344 | To execute following BF Shell commands, |
Charles Chan | bf55e74 | 2021-10-04 17:46:46 -0700 | [diff] [blame] | 345 | |
| 346 | - Login to Stratum switch by `ssh root@<switch_ip>`. Default password is `onl` |
| 347 | - Attach to Stratum docker container by `docker attach \`docker ps | grep stratum-bfrt | awk \'{print $1}\'\`` |
| 348 | |
| 349 | - Hit `enter` for the prompt |
| 350 | - Use `<Ctrl-P><Ctrl-Q>` to exit the container. Do not use `<Ctrl-C>` since it will terminate the process. |
| 351 | |
Carmelo Cascone | 4398998 | 2021-10-12 00:01:19 -0700 | [diff] [blame] | 352 | BF Shell |
| 353 | """""""" |
Charles Chan | bf55e74 | 2021-10-04 17:46:46 -0700 | [diff] [blame] | 354 | - `pm.show`: List port configurations. `-a` to list all ports. |