| .. SPDX-FileCopyrightText: 2021 Open Networking Foundation <info@opennetworking.org> |
| .. SPDX-License-Identifier: Apache-2.0 |
| |
| .. _troubleshooting_guide: |
| |
| Troubleshooting Guide |
| ===================== |
| |
| In this section we are going to provide hints and useful commands to help you troubleshoot traffic-related problems |
| or k8s related issues. It is important to remember that these two types of issues are highly related as both |
| control plane software and data plane software are containerized and deployed as Kubernetes services in SD-Fabric. |
| Please refer to :ref:`architecture_design` for further details. |
| |
| ONL troubleshooting |
| ------------------- |
| |
| Can't reboot into ONL, loops on ONIE installer mode |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Sometimes an ONL installation is incomplete or problematic, and reinstalling it |
| doesn't result in a working system. |
| |
| If this is the case, reboot into ONIE Rescue mode and use ``parted`` to delete |
| all the ``ONL-`` prefixed partitions, then reinstall with an ``onie-installer`` |
| image. |
| |
| K8s troubleshooting |
| ------------------- |
| |
| We assume that the tool ``kubectl`` have been install already on your local machine. |
| First step is to setup the proper ``kubeconfig`` file to access the k8s cluster you want to troubleshoot: |
| |
| .. code-block:: |
| |
| $ export KUBECONFIG=~/kubeconfig/dev-sdfabric-menlo |
| $ kubectl config use-context dev-sdfabric-menlo |
| Switched to context "dev-sdfabric-menlo". |
| |
| You can get the list of the k8s namespaces using ``kubectl get`` command: |
| |
| .. code-block:: |
| |
| $ kubectl get namespaces |
| ... |
| kube-node-lease Active 68d |
| kube-public Active 68d |
| kube-system Active 68d |
| security-scan Active 68d |
| sdfabric Active 26h |
| |
| Let's assume that SD-Fabric resources are deployed under the namespace ``sdfabric``, so make sure that the ``sdfabric`` |
| namespace has been properly created (additionally other namespaces could be created - please check your overarching chart). |
| |
| If the deployment is not successful, |
| a first check is to make sure there are enough available nodes in the target cluster. |
| You can check the available nodes through ``kubectl get nodes`` command: |
| |
| .. code-block:: |
| |
| $ kubectl get nodes -o wide |
| NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME |
| compute1 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.74 <none> Ubuntu 18.04.6 LTS 5.4.0-73-generic docker://20.10.9 |
| compute2 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.72 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15 |
| compute3 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.68 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15 |
| leaf1 Ready worker 39d v1.18.8 10.76.28.70 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15 |
| leaf2 Ready worker 39d v1.18.8 10.76.28.71 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15 |
| |
| You should have at least `3+N` available nodes, where N depends on the deployed network topology. Please note that ONOS |
| cannot be scheduled on the network devices (these are special worker nodes), and different ONOS cannot share the same worker |
| node (the same applies for Atomix). |
| |
| At least you should have some basic containers that are present in each deployment. |
| You can get the list of the pods by using ``kubectl get pods -n sdfabric``: |
| |
| .. code-block:: |
| |
| $ kubectl get pods -n sdfabric -o wide |
| NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES |
| onos-tost-atomix-0 1/1 Running 0 6h31m 10.72.106.161 compute3 <none> <none> |
| onos-tost-atomix-1 1/1 Running 0 6h31m 10.72.111.229 compute1 <none> <none> |
| onos-tost-atomix-2 1/1 Running 0 6h31m 10.72.75.254 compute2 <none> <none> |
| onos-tost-onos-classic-0 1/1 Running 0 98m 10.72.106.133 compute3 <none> <none> |
| onos-tost-onos-classic-1 1/1 Running 0 6h31m 10.72.111.207 compute1 <none> <none> |
| onos-tost-onos-classic-2 1/1 Running 0 6h31m 10.72.75.247 compute2 <none> <none> |
| onos-tost-onos-classic-onos-config-loader-ddc9d68bb-lq97t 1/1 Running 0 6h19m 10.72.106.190 compute3 <none> <none> |
| stratum-bwlvh 1/1 Running 0 6h31m 10.76.28.70 leaf1 <none> <none> |
| stratum-gh842 1/1 Running 0 6h31m 10.76.28.71 leaf2 <none> <none> |
| |
| 3 Atomix nodes and 3 ONOS nodes are needed for HA. `onos-config-loader` is equally important, because without ONOS |
| cannot be properly configured. The number of Stratum pods depend on the deployed topology. If the status of the pods |
| is not `Running` you can check the events published by k8s components to have a first idea of what is happening: |
| |
| .. code-block:: |
| |
| $ kubectl get events -n sdfabric --sort-by='.lastTimestamp' |
| LAST SEEN TYPE REASON OBJECT MESSAGE |
| 12m Normal Scheduled pod/telegraf-75b959574d-sl8qb Successfully assigned tost/telegraf-75b959574d-sl8qb to compute3 |
| 12m Normal SuccessfulCreate replicaset/telegraf-75b959574d Created pod: telegraf-75b959574d-sl8qb |
| 12m Normal ScalingReplicaSet deployment/telegraf Scaled up replica set telegraf-75b959574d to 1 |
| 12m Normal Pulled pod/telegraf-75b959574d-sl8qb Container image "telegraf:1.17" already present on machine |
| 12m Normal AddedInterface pod/telegraf-75b959574d-sl8qb Add eth0 [10.72.106.153/32] |
| 12m Normal Started pod/telegraf-75b959574d-sl8qb Started container telegraf |
| 12m Normal Created pod/telegraf-75b959574d-sl8qb Created container telegraf |
| ... |
| |
| The option ``--sort-by='.lastTimestamp'`` is typically used to get the events sorted by time. The previous command |
| will report all the events happened in the ``sdfabric`` namespace, if you want to have more insights on a specific |
| pod, it is possible to use the command ``kubectl describe pods``: |
| |
| .. code-block:: |
| |
| $ kubectl describe pods -n sdfabric onos-tost-onos-classic-0 |
| Name: onos-tost-onos-classic-0 |
| Namespace: sdfabric |
| Priority: 0 |
| Node: compute3/10.76.28.68 |
| Start Time: Mon, 11 Oct 2021 10:35:43 +0200 |
| ... |
| Events: |
| Type Reason Age From Message |
| ... |
| {"message":"pending"} |
| org.onosproject.segmentrouting is not yet ready |
| |
| The ``Events`` section provides typically useful information about the issues the pod is facing. |
| |
| Both ONOS and Atomix define readiness probes which will make sure that the pods are ready before any configuration |
| will take place. As consequence of this, if the probes fail for a given pod you will notice in the output of the command |
| ``kubectl get pods``` near its name ``0/1`` under the column ``READY``. We report in `ONOS pod not ready (1)`_ and |
| `ONOS pod not ready (2)`_ two scenarios frequently faced by the SD-Fabric developers. |
| |
| Logs of the SD-Fabric pods can be accessed by using ``kubectl logs`` command |
| |
| .. code-block:: |
| |
| $ kubectl -n sdfabric logs onos-tost-onos-classic-0 |
| 2021-10-12 04:46:17,955 INFO [EventAdminConfigurationNotifier] Sending Event Admin notification (configuration successful) to org/ops4j/pax/logging/Configuration |
| ... |
| 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Changes to perform: |
| 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Region: root |
| 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Bundles to install: |
| |
| |
| ONOS Troubleshooting |
| -------------------- |
| |
| You can get the ONOS CLI by establishing SSH connection to the port ``8101`` (default password is `karaf`): |
| |
| .. code-block:: |
| |
| $ kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101 |
| // In another terminal or you can send to /dev/null the port-forward |
| $ ssh -p 8101 karaf@localhost |
| The authenticity of host '[localhost]:8101 ([127.0.0.1]:8101)' can't be established. |
| RSA key fingerprint is SHA256:Mlaax9tHmIR6WwK0B3okC1O4mpAuoXjI7Z5+KKelxOo. |
| Are you sure you want to continue connecting (yes/no)? yes |
| Warning: Permanently added '[localhost]:8101' (RSA) to the list of known hosts. |
| Password authentication |
| Password: |
| Welcome to Open Network Operating System (ONOS)! |
| ____ _ ______ ____ |
| / __ \/ |/ / __ \/ __/ |
| / /_/ / / /_/ /\ \ |
| \____/_/|_/\____/___/ |
| |
| Documentation: wiki.onosproject.org |
| Tutorials: tutorials.onosproject.org |
| Mailing lists: lists.onosproject.org |
| |
| Come help out! Find out how at: contribute.onosproject.org |
| |
| Hit '<tab>' for a list of available commands |
| and '[cmd] --help' for help on a specific command. |
| Hit '<ctrl-d>' or type 'logout' to exit ONOS session. |
| |
| karaf@root > |
| |
| Alternatively, if this is not possible to establish an ssh connection with the ONOS pods, |
| it is possible to use ``kubectl exec`` command on the target pod: |
| |
| .. code-block:: |
| |
| $ kubectl -n sdfabric exec -it onos-tost-onos-classic-0 -- bash apache-karaf-4.2.14/bin/client |
| Welcome to Open Network Operating System (ONOS)! |
| ____ _ ______ ____ |
| / __ \/ |/ / __ \/ __/ |
| / /_/ / / /_/ /\ \ |
| \____/_/|_/\____/___/ |
| |
| Documentation: wiki.onosproject.org |
| Tutorials: tutorials.onosproject.org |
| Mailing lists: lists.onosproject.org |
| |
| Come help out! Find out how at: contribute.onosproject.org |
| |
| Hit '<tab>' for a list of available commands |
| and '[cmd] --help' for help on a specific command. |
| Hit '<ctrl-d>' or type 'logout' to exit ONOS session. |
| |
| karaf@root |
| |
| You can attach to the ONOS logs by using the ``log:tail`` command: |
| |
| .. code-block:: |
| |
| $ karaf@root > log:tail |
| 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine1 -> device:leaf1 |
| 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine2 -> device:leaf1 |
| 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf1 -> device:spine1 |
| 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf2 -> device:spine1 |
| |
| The command will display continuously the log entries - this is useful for a live debugging session. |
| Complete ONOS logs can be accessed by using ``kubectl logs`` command as explained in the previous section. |
| If anything can be figured out from the logs, you can access |
| to the ONOS state by issuing specific CLI commands. We report in the section `Frequently Used Commands`_ few commands we frequently use |
| when troubleshooting SD-Fabric. |
| |
| Pipeline Walk-through |
| ^^^^^^^^^^^^^^^^^^^^^ |
| .. note:: |
| More information of Pipeline Walk-through is coming soon |
| |
| onos-diagnostics |
| ^^^^^^^^^^^^^^^^ |
| |
| In the case where you can't figure out what is going wrong, you can seek help on SD-Fabric developer mailing list |
| ``sdfabric-dev@opennetworking.org`` or you can reach out on the ``sdfabric-dev`` Slack channel. There are a few |
| things we would like you to attach: |
| |
| - **Issue description** |
| |
| - **Environment description**, such as SD-Fabric version, switch model and SDE version |
| version |
| |
| - **Steps of reproduction**, as detail as possible |
| |
| - **Diagnostics**. |
| |
| We have built a tool `onos-diagnostics-k8s <https://wiki.onosproject.org/display/ONOS/ONOS+Remote+Admin+Tools>`_ |
| to help you easily collect and package ONOS diagnostics. The tool collects various information from the running |
| ONOS cluster and packages it into one, easy-to-share archive file. This tool is distributed as part of the ONOS |
| software itself (under bin directory), but is also available as part of a small archive of remote tools to administer |
| an ONOS cluster (`onos-admin-\*.tar.gz`). |
| |
| Alternatively, it is possible to use ``onos-diagnostics-k8s`` in Kubernetes enabled environments. The tool will produce |
| the same results of onos-diagnostics and relies only on ``kubectl`` commands. The tool need to know the name of |
| the namespace and this can be provided through the option ``-s``. Then, you have to provide the names of the target |
| pods. To avoid having to specify these names as part of the command, you can export the ``ONOS_PODS`` environment |
| variable. Here’s an example of how to set the variable: |
| |
| .. code-block:: |
| |
| $ export ONOS_PODS="onos-0 onos-1 onos-2" |
| |
| The tool needs to know the Karaf home (path from the mount point). To avoid having to specify this path as part |
| of the command, you can export the ``KARAF_HOME`` environment variable: |
| |
| .. code-block:: |
| |
| $ export KARAF_HOME="apache-karaf-4.2.14" |
| |
| Once done, the ``onos-diagnostics-k8s`` tool can be run as follows: |
| |
| .. code-block:: |
| |
| $ onos-diagnostics-k8s -s sdfabric |
| |
| There is the option ``-n`` that allows for naming the resulting archive file for differentiation between different |
| cluster instances, e.g. |
| |
| .. code-block:: |
| |
| # This will produce archive file /tmp/delta-pod-diags.tar.gz |
| $ onos-diagnostics-k8s -s sdfabric -n delta-pod |
| |
| By default ``onos-diagnostics-k8s`` will use ``ONOS_PROFILE`` to collect the diagnostics, you can tailor the behavior of the |
| command to your needs by specifying a different `profile <https://github.com/opennetworkinglab/onos/blob/master/tools/package/runtime/bin/onos-diagnostics-profile>`_. |
| For SD-Fabric we suggest to use ``TRELLIS_PROFILE``. The resulting `/tmp/\*-diags.tar.gz` file will contain all |
| relevant information about the ONOS cluster. |
| |
| The following is an example of a complete ``onos-diagnostics-k8s`` command: |
| |
| .. code-block:: |
| |
| $ DIAGS_PROFILE=TRELLIS_PROFILE onos-diagnostics-k8s -k apache-karaf-4.2.14 -s sdfabric onos-tost-onos-classic-0 onos-tost-onos-classic-1 onos-tost-onos-classic-2 |
| |
| UP4 Troubleshooting |
| ------------------- |
| |
| .. note:: |
| More information of UP4 troubleshoot is coming soon |
| |
| Common Issues |
| ------------- |
| |
| .. note:: |
| Here is a list of common issues. |
| More details of each case are coming soon |
| |
| ImagePullBackOff |
| ^^^^^^^^^^^^^^^^ |
| |
| ONOS pod not ready (1) |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| ONOS pod not ready (2) |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| ONOS pods not configured |
| ^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Packet-In not working |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| Device offline |
| ^^^^^^^^^^^^^^ |
| |
| Frequently Used Commands |
| ------------------------ |
| |
| In this subsection, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric. |
| |
| ONOS |
| ^^^^ |
| To execute following ONOS CLI commands, |
| |
| - Create K8s port forwarding by `kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101` |
| - Login to ONOS CLI by `ssh -p 8101 karaf@localhost`. Default password is `karaf` |
| |
| ONOS basics |
| """"""""""" |
| - ``flows``: List flow tables. `-s` for simplified output. |
| - ``groups``: List group tables. `-s` for simplified output. |
| - ``devices``: List device information. `-s` for simplified output. |
| - ``ports``: List port information. `-e` to list enabled ports only. |
| - ``links``: List discovered links |
| - ``hosts``: List discovered hosts. `-s` for simplified output. |
| - ``netcfg``: List network configuration |
| - ``interfaces``: List interface configuration |
| |
| trellis-control |
| """"""""""""""" |
| - ``sr-pr-list``: List current recovery phase of each device |
| - ``sr-device-subnets``: List device-subnet mapping |
| |
| fabric-tna |
| """""""""" |
| - ``slices``: List network slices |
| - ``tcs``: List traffic classes of given slice |
| |
| up4 |
| """ |
| - ``read-entities -a``: Print UPF entities installed in the UPF dataplane. |
| More options are available. See ``read-entities --help`` |
| |
| |
| Stratum |
| ^^^^^^^ |
| To execute following BF Shell commands, |
| |
| - Login to Stratum switch by `ssh root@<switch_ip>`. Default password is `onl` |
| - Attach to Stratum docker container by `docker attach \`docker ps | grep stratum-bfrt | awk \'{print $1}\'\`` |
| |
| - Hit `enter` for the prompt |
| - Use `<Ctrl-P><Ctrl-Q>` to exit the container. Do not use `<Ctrl-C>` since it will terminate the process. |
| |
| BF Shell |
| """""""" |
| - ``pm.show``: List port configurations. `-a` to list all ports. |