blob: 372fe8cb03171d496d8a7f3bdcf4c7940ed5d914 [file] [log] [blame]
.. _troubleshooting_guide:
Troubleshooting Guide
=====================
In this section we are going to provide hints and useful commands to help you troubleshoot traffic-related problems
or k8s related issues. It is important to remember that these two types of issues are highly related as both
control plane software and data plane software are containerized and deployed as Kubernetes services in SD-Fabric.
Please refer to :ref:`architecture_design` for further details.
K8s troubleshooting
-------------------
We assume that the tool ``kubectl`` have been install already on your local machine.
First step is to setup the proper ``kubeconfig`` file to access the k8s cluster you want to troubleshoot:
.. code-block::
$ export KUBECONFIG=~/kubeconfig/dev-sdfabric-menlo
$ kubectl config use-context dev-sdfabric-menlo
Switched to context "dev-sdfabric-menlo".
You can get the list of the k8s namespaces using ``kubectl get`` command:
.. code-block::
$ kubectl get namespaces
...
kube-node-lease Active 68d
kube-public Active 68d
kube-system Active 68d
security-scan Active 68d
sdfabric Active 26h
Let's assume that SD-Fabric resources are deployed under the namespace ``sdfabric``, so make sure that the ``sdfabric``
namespace has been properly created (additionally other namespaces could be created - please check your overarching chart).
If the deployment is not successful,
a first check is to make sure there are enough available nodes in the target cluster.
You can check the available nodes through ``kubectl get nodes`` command:
.. code-block::
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
compute1 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.74 <none> Ubuntu 18.04.6 LTS 5.4.0-73-generic docker://20.10.9
compute2 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.72 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
compute3 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.68 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
leaf1 Ready worker 39d v1.18.8 10.76.28.70 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15
leaf2 Ready worker 39d v1.18.8 10.76.28.71 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15
You should have at least `3+N` available nodes, where N depends on the deployed network topology. Please note that ONOS
cannot be scheduled on the network devices (these are special worker nodes), and different ONOS cannot share the same worker
node (the same applies for Atomix).
At least you should have some basic containers that are present in each deployment.
You can get the list of the pods by using ``kubectl get pods -n sdfabric``:
.. code-block::
$ kubectl get pods -n sdfabric -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
onos-tost-atomix-0 1/1 Running 0 6h31m 10.72.106.161 compute3 <none> <none>
onos-tost-atomix-1 1/1 Running 0 6h31m 10.72.111.229 compute1 <none> <none>
onos-tost-atomix-2 1/1 Running 0 6h31m 10.72.75.254 compute2 <none> <none>
onos-tost-onos-classic-0 1/1 Running 0 98m 10.72.106.133 compute3 <none> <none>
onos-tost-onos-classic-1 1/1 Running 0 6h31m 10.72.111.207 compute1 <none> <none>
onos-tost-onos-classic-2 1/1 Running 0 6h31m 10.72.75.247 compute2 <none> <none>
onos-tost-onos-classic-onos-config-loader-ddc9d68bb-lq97t 1/1 Running 0 6h19m 10.72.106.190 compute3 <none> <none>
stratum-bwlvh 1/1 Running 0 6h31m 10.76.28.70 leaf1 <none> <none>
stratum-gh842 1/1 Running 0 6h31m 10.76.28.71 leaf2 <none> <none>
3 Atomix nodes and 3 ONOS nodes are needed for HA. `onos-config-loader` is equally important, because without ONOS
cannot be properly configured. The number of Stratum pods depend on the deployed topology. If the status of the pods
is not `Running` you can check the events published by k8s components to have a first idea of what is happening:
.. code-block::
$ kubectl get events -n sdfabric --sort-by='.lastTimestamp'
LAST SEEN TYPE REASON OBJECT MESSAGE
12m Normal Scheduled pod/telegraf-75b959574d-sl8qb Successfully assigned tost/telegraf-75b959574d-sl8qb to compute3
12m Normal SuccessfulCreate replicaset/telegraf-75b959574d Created pod: telegraf-75b959574d-sl8qb
12m Normal ScalingReplicaSet deployment/telegraf Scaled up replica set telegraf-75b959574d to 1
12m Normal Pulled pod/telegraf-75b959574d-sl8qb Container image "telegraf:1.17" already present on machine
12m Normal AddedInterface pod/telegraf-75b959574d-sl8qb Add eth0 [10.72.106.153/32]
12m Normal Started pod/telegraf-75b959574d-sl8qb Started container telegraf
12m Normal Created pod/telegraf-75b959574d-sl8qb Created container telegraf
...
The option ``--sort-by='.lastTimestamp'`` is typically used to get the events sorted by time. The previous command
will report all the events happened in the ``sdfabric`` namespace, if you want to have more insights on a specific
pod, it is possible to use the command ``kubectl describe pods``:
.. code-block::
$ kubectl describe pods -n sdfabric onos-tost-onos-classic-0
Name: onos-tost-onos-classic-0
Namespace: sdfabric
Priority: 0
Node: compute3/10.76.28.68
Start Time: Mon, 11 Oct 2021 10:35:43 +0200
...
Events:
Type Reason Age From Message
...
{"message":"pending"}
org.onosproject.segmentrouting is not yet ready
The ``Events`` section provides typically useful information about the issues the pod is facing.
Both ONOS and Atomix define readiness probes which will make sure that the pods are ready before any configuration
will take place. As consequence of this, if the probes fail for a given pod you will notice in the output of the command
``kubectl get pods``` near its name ``0/1`` under the column ``READY``. We report in `ONOS pod not ready (1)`_ and
`ONOS pod not ready (2)`_ two scenarios frequently faced by the SD-Fabric developers.
Logs of the SD-Fabric pods can be accessed by using ``kubectl logs`` command
.. code-block::
$ kubectl -n sdfabric logs onos-tost-onos-classic-0
2021-10-12 04:46:17,955 INFO [EventAdminConfigurationNotifier] Sending Event Admin notification (configuration successful) to org/ops4j/pax/logging/Configuration
...
2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Changes to perform:
2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Region: root
2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Bundles to install:
ONOS Troubleshooting
--------------------
You can get the ONOS CLI by establishing SSH connection to the port ``8101`` (default password is `karaf`):
.. code-block::
$ kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101
// In another terminal or you can send to /dev/null the port-forward
$ ssh -p 8101 karaf@localhost
The authenticity of host '[localhost]:8101 ([127.0.0.1]:8101)' can't be established.
RSA key fingerprint is SHA256:Mlaax9tHmIR6WwK0B3okC1O4mpAuoXjI7Z5+KKelxOo.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[localhost]:8101' (RSA) to the list of known hosts.
Password authentication
Password:
Welcome to Open Network Operating System (ONOS)!
____ _ ______ ____
/ __ \/ |/ / __ \/ __/
/ /_/ / / /_/ /\ \
\____/_/|_/\____/___/
Documentation: wiki.onosproject.org
Tutorials: tutorials.onosproject.org
Mailing lists: lists.onosproject.org
Come help out! Find out how at: contribute.onosproject.org
Hit '<tab>' for a list of available commands
and '[cmd] --help' for help on a specific command.
Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
karaf@root >
Alternatively, if this is not possible to establish an ssh connection with the ONOS pods,
it is possible to use ``kubectl exec`` command on the target pod:
.. code-block::
$ kubectl -n sdfabric exec -it onos-tost-onos-classic-0 -- bash apache-karaf-4.2.14/bin/client
Welcome to Open Network Operating System (ONOS)!
____ _ ______ ____
/ __ \/ |/ / __ \/ __/
/ /_/ / / /_/ /\ \
\____/_/|_/\____/___/
Documentation: wiki.onosproject.org
Tutorials: tutorials.onosproject.org
Mailing lists: lists.onosproject.org
Come help out! Find out how at: contribute.onosproject.org
Hit '<tab>' for a list of available commands
and '[cmd] --help' for help on a specific command.
Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
karaf@root
You can attach to the ONOS logs by using the ``log:tail`` command:
.. code-block::
$ karaf@root > log:tail
20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine1 -> device:leaf1
20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine2 -> device:leaf1
20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf1 -> device:spine1
20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf2 -> device:spine1
The command will display continuously the log entries - this is useful for a live debugging session.
Complete ONOS logs can be accessed by using ``kubectl logs`` command as explained in the previous section.
If anything can be figured out from the logs, you can access
to the ONOS state by issuing specific CLI commands. We report in the section `Frequently Used Commands`_ few commands we frequently use
when troubleshooting SD-Fabric.
Pipeline Walk-through
^^^^^^^^^^^^^^^^^^^^^
.. note::
More information of Pipeline Walk-through is coming soon
onos-diagnostics
^^^^^^^^^^^^^^^^
In the case where you can't figure out what is going wrong, you can seek help on SD-Fabric developer mailing list
``sdfabric-dev@opennetworking.org`` or you can reach out on the ``sdfabric-dev`` Slack channel. There are a few
things we would like you to attach:
- **Issue description**
- **Environment description**, such as SD-Fabric version, switch model and SDE version
version
- **Steps of reproduction**, as detail as possible
- **Diagnostics**.
We have built a tool `onos-diagnostics-k8s <https://wiki.onosproject.org/display/ONOS/ONOS+Remote+Admin+Tools>`_
to help you easily collect and package ONOS diagnostics. The tool collects various information from the running
ONOS cluster and packages it into one, easy-to-share archive file. This tool is distributed as part of the ONOS
software itself (under bin directory), but is also available as part of a small archive of remote tools to administer
an ONOS cluster (`onos-admin-\*.tar.gz`).
Alternatively, it is possible to use ``onos-diagnostics-k8s`` in Kubernetes enabled environments. The tool will produce
the same results of onos-diagnostics and relies only on ``kubectl`` commands. The tool need to know the name of
the namespace and this can be provided through the option ``-s``. Then, you have to provide the names of the target
pods. To avoid having to specify these names as part of the command, you can export the ``ONOS_PODS`` environment
variable. Here’s an example of how to set the variable:
.. code-block::
$ export ONOS_PODS="onos-0 onos-1 onos-2"
The tool needs to know the Karaf home (path from the mount point). To avoid having to specify this path as part
of the command, you can export the ``KARAF_HOME`` environment variable:
.. code-block::
$ export KARAF_HOME="apache-karaf-4.2.14"
Once done, the ``onos-diagnostics-k8s`` tool can be run as follows:
.. code-block::
$ onos-diagnostics-k8s -s sdfabric
There is the option ``-n`` that allows for naming the resulting archive file for differentiation between different
cluster instances, e.g.
.. code-block::
# This will produce archive file /tmp/delta-pod-diags.tar.gz
$ onos-diagnostics-k8s -s sdfabric -n delta-pod
By default ``onos-diagnostics-k8s`` will use ``ONOS_PROFILE`` to collect the diagnostics, you can tailor the behavior of the
command to your needs by specifying a different `profile <https://github.com/opennetworkinglab/onos/blob/master/tools/package/runtime/bin/onos-diagnostics-profile>`_.
For SD-Fabric we suggest to use ``TRELLIS_PROFILE``. The resulting `/tmp/\*-diags.tar.gz` file will contain all
relevant information about the ONOS cluster.
The following is an example of a complete ``onos-diagnostics-k8s`` command:
.. code-block::
$ DIAGS_PROFILE=TRELLIS_PROFILE onos-diagnostics-k8s -k apache-karaf-4.2.14 -s sdfabric onos-tost-onos-classic-0 onos-tost-onos-classic-1 onos-tost-onos-classic-2
UP4 Troubleshooting
-------------------
.. note::
More information of UP4 troubleshoot is coming soon
Common Issues
-------------
.. note::
Here is a list of common issues.
More details of each case are coming soon
ImagePullBackOff
^^^^^^^^^^^^^^^^
ONOS pod not ready (1)
^^^^^^^^^^^^^^^^^^^^^^
ONOS pod not ready (2)
^^^^^^^^^^^^^^^^^^^^^^
ONOS pods not configured
^^^^^^^^^^^^^^^^^^^^^^^^
Packet-In not working
^^^^^^^^^^^^^^^^^^^^^
Device offline
^^^^^^^^^^^^^^
Frequently Used Commands
------------------------
In this subsection, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric.
ONOS
^^^^
To execute following ONOS CLI commands,
- Create K8s port forwarding by `kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101`
- Login to ONOS CLI by `ssh -p 8101 karaf@localhost`. Default password is `karaf`
ONOS basics
"""""""""""
- `flows`: List flow tables. `-s` for simplified output.
- `groups`: List group tables. `-s` for simplified output.
- `devices`: List device information. `-s` for simplified output.
- `ports`: List port information. `-e` to list enabled ports only.
- `links`: List discovered links
- `hosts`: List discovered hosts. `-s` for simplified output.
- `netcfg`: List network configuration
- `interfaces`: List interface configuration
trellis-control
"""""""""""""""
- `sr-pr-list`: List current recovery phase of each device
- `sr-device-subnets`: List device-subnet mapping
fabric-tna
""""""""""
- `slices`: List network slices
- `tcs`: List traffic classes of given slice
up4
"""
- `read-interfaces`: List all interfaces installed in the data plane
- `read-pdrs`: List all PDRs installed in the data plane
- `read-fars`: List all FARS installed in the data plane
- `read-flows`: List all UE data flows installed in the data plane)
Stratum
^^^^^^^
To execute following BF Shell commands,
- Login to Stratum switch by `ssh root@<switch_ip>`. Default password is `onl`
- Attach to Stratum docker container by `docker attach \`docker ps | grep stratum-bfrt | awk \'{print $1}\'\``
- Hit `enter` for the prompt
- Use `<Ctrl-P><Ctrl-Q>` to exit the container. Do not use `<Ctrl-C>` since it will terminate the process.
BF Shell
""""""""
- `pm.show`: List port configurations. `-a` to list all ports.