blob: 5169dc3166cf2905e9c5190d2035c1ccae1de4d3 [file] [log] [blame]
Charles Chanfcfe8902022-02-02 17:06:27 -08001.. SPDX-FileCopyrightText: 2021 Open Networking Foundation <info@opennetworking.org>
2.. SPDX-License-Identifier: Apache-2.0
3
Daniele Moro5212da62021-10-11 16:20:26 +02004.. _troubleshooting_guide:
5
Charles Chancaebcf32021-09-20 22:17:52 -07006Troubleshooting Guide
7=====================
Charles Chanbf55e742021-10-04 17:46:46 -07008
pierventre517cd532021-10-12 22:58:00 +02009In this section we are going to provide hints and useful commands to help you troubleshoot traffic-related problems
10or k8s related issues. It is important to remember that these two types of issues are highly related as both
11control plane software and data plane software are containerized and deployed as Kubernetes services in SD-Fabric.
12Please refer to :ref:`architecture_design` for further details.
13
14K8s troubleshooting
15-------------------
16
17We assume that the tool ``kubectl`` have been install already on your local machine.
18First step is to setup the proper ``kubeconfig`` file to access the k8s cluster you want to troubleshoot:
19
20.. code-block::
21
22 $ export KUBECONFIG=~/kubeconfig/dev-sdfabric-menlo
23 $ kubectl config use-context dev-sdfabric-menlo
24 Switched to context "dev-sdfabric-menlo".
25
26You can get the list of the k8s namespaces using ``kubectl get`` command:
27
28.. code-block::
29
30 $ kubectl get namespaces
31 ...
32 kube-node-lease Active 68d
33 kube-public Active 68d
34 kube-system Active 68d
35 security-scan Active 68d
36 sdfabric Active 26h
37
38Let's assume that SD-Fabric resources are deployed under the namespace ``sdfabric``, so make sure that the ``sdfabric``
39namespace has been properly created (additionally other namespaces could be created - please check your overarching chart).
40
41If the deployment is not successful,
42a first check is to make sure there are enough available nodes in the target cluster.
43You can check the available nodes through ``kubectl get nodes`` command:
44
45.. code-block::
46
47 $ kubectl get nodes -o wide
48 NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
49 compute1 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.74 <none> Ubuntu 18.04.6 LTS 5.4.0-73-generic docker://20.10.9
50 compute2 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.72 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
51 compute3 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.68 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
52 leaf1 Ready worker 39d v1.18.8 10.76.28.70 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15
53 leaf2 Ready worker 39d v1.18.8 10.76.28.71 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15
54
55You should have at least `3+N` available nodes, where N depends on the deployed network topology. Please note that ONOS
56cannot be scheduled on the network devices (these are special worker nodes), and different ONOS cannot share the same worker
57node (the same applies for Atomix).
58
59At least you should have some basic containers that are present in each deployment.
60You can get the list of the pods by using ``kubectl get pods -n sdfabric``:
61
62.. code-block::
63
64 $ kubectl get pods -n sdfabric -o wide
65 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
66 onos-tost-atomix-0 1/1 Running 0 6h31m 10.72.106.161 compute3 <none> <none>
67 onos-tost-atomix-1 1/1 Running 0 6h31m 10.72.111.229 compute1 <none> <none>
68 onos-tost-atomix-2 1/1 Running 0 6h31m 10.72.75.254 compute2 <none> <none>
69 onos-tost-onos-classic-0 1/1 Running 0 98m 10.72.106.133 compute3 <none> <none>
70 onos-tost-onos-classic-1 1/1 Running 0 6h31m 10.72.111.207 compute1 <none> <none>
71 onos-tost-onos-classic-2 1/1 Running 0 6h31m 10.72.75.247 compute2 <none> <none>
72 onos-tost-onos-classic-onos-config-loader-ddc9d68bb-lq97t 1/1 Running 0 6h19m 10.72.106.190 compute3 <none> <none>
73 stratum-bwlvh 1/1 Running 0 6h31m 10.76.28.70 leaf1 <none> <none>
74 stratum-gh842 1/1 Running 0 6h31m 10.76.28.71 leaf2 <none> <none>
75
763 Atomix nodes and 3 ONOS nodes are needed for HA. `onos-config-loader` is equally important, because without ONOS
77cannot be properly configured. The number of Stratum pods depend on the deployed topology. If the status of the pods
78is not `Running` you can check the events published by k8s components to have a first idea of what is happening:
79
80.. code-block::
81
82 $ kubectl get events -n sdfabric --sort-by='.lastTimestamp'
83 LAST SEEN TYPE REASON OBJECT MESSAGE
84 12m Normal Scheduled pod/telegraf-75b959574d-sl8qb Successfully assigned tost/telegraf-75b959574d-sl8qb to compute3
85 12m Normal SuccessfulCreate replicaset/telegraf-75b959574d Created pod: telegraf-75b959574d-sl8qb
86 12m Normal ScalingReplicaSet deployment/telegraf Scaled up replica set telegraf-75b959574d to 1
87 12m Normal Pulled pod/telegraf-75b959574d-sl8qb Container image "telegraf:1.17" already present on machine
88 12m Normal AddedInterface pod/telegraf-75b959574d-sl8qb Add eth0 [10.72.106.153/32]
89 12m Normal Started pod/telegraf-75b959574d-sl8qb Started container telegraf
90 12m Normal Created pod/telegraf-75b959574d-sl8qb Created container telegraf
91 ...
92
93The option ``--sort-by='.lastTimestamp'`` is typically used to get the events sorted by time. The previous command
94will report all the events happened in the ``sdfabric`` namespace, if you want to have more insights on a specific
95pod, it is possible to use the command ``kubectl describe pods``:
96
97.. code-block::
98
99 $ kubectl describe pods -n sdfabric onos-tost-onos-classic-0
100 Name: onos-tost-onos-classic-0
101 Namespace: sdfabric
102 Priority: 0
103 Node: compute3/10.76.28.68
104 Start Time: Mon, 11 Oct 2021 10:35:43 +0200
105 ...
106 Events:
107 Type Reason Age From Message
108 ...
109 {"message":"pending"}
110 org.onosproject.segmentrouting is not yet ready
111
112The ``Events`` section provides typically useful information about the issues the pod is facing.
113
114Both ONOS and Atomix define readiness probes which will make sure that the pods are ready before any configuration
115will take place. As consequence of this, if the probes fail for a given pod you will notice in the output of the command
116``kubectl get pods``` near its name ``0/1`` under the column ``READY``. We report in `ONOS pod not ready (1)`_ and
117`ONOS pod not ready (2)`_ two scenarios frequently faced by the SD-Fabric developers.
118
119Logs of the SD-Fabric pods can be accessed by using ``kubectl logs`` command
120
121.. code-block::
122
123 $ kubectl -n sdfabric logs onos-tost-onos-classic-0
124 2021-10-12 04:46:17,955 INFO [EventAdminConfigurationNotifier] Sending Event Admin notification (configuration successful) to org/ops4j/pax/logging/Configuration
125 ...
126 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Changes to perform:
127 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Region: root
128 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Bundles to install:
129
130
131ONOS Troubleshooting
132--------------------
133
134You can get the ONOS CLI by establishing SSH connection to the port ``8101`` (default password is `karaf`):
135
136.. code-block::
137
pierventre16cc8022021-10-14 10:34:57 +0200138 $ kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101
pierventre517cd532021-10-12 22:58:00 +0200139 // In another terminal or you can send to /dev/null the port-forward
140 $ ssh -p 8101 karaf@localhost
141 The authenticity of host '[localhost]:8101 ([127.0.0.1]:8101)' can't be established.
142 RSA key fingerprint is SHA256:Mlaax9tHmIR6WwK0B3okC1O4mpAuoXjI7Z5+KKelxOo.
143 Are you sure you want to continue connecting (yes/no)? yes
144 Warning: Permanently added '[localhost]:8101' (RSA) to the list of known hosts.
145 Password authentication
146 Password:
147 Welcome to Open Network Operating System (ONOS)!
148 ____ _ ______ ____
149 / __ \/ |/ / __ \/ __/
150 / /_/ / / /_/ /\ \
151 \____/_/|_/\____/___/
152
153 Documentation: wiki.onosproject.org
154 Tutorials: tutorials.onosproject.org
155 Mailing lists: lists.onosproject.org
156
157 Come help out! Find out how at: contribute.onosproject.org
158
159 Hit '<tab>' for a list of available commands
160 and '[cmd] --help' for help on a specific command.
161 Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
162
163 karaf@root >
164
165Alternatively, if this is not possible to establish an ssh connection with the ONOS pods,
166it is possible to use ``kubectl exec`` command on the target pod:
167
168.. code-block::
169
pierventre733bf982022-01-12 21:40:28 +0100170 $ kubectl -n sdfabric exec -it onos-tost-onos-classic-0 -- bash apache-karaf-4.2.14/bin/client
pierventre517cd532021-10-12 22:58:00 +0200171 Welcome to Open Network Operating System (ONOS)!
172 ____ _ ______ ____
173 / __ \/ |/ / __ \/ __/
174 / /_/ / / /_/ /\ \
175 \____/_/|_/\____/___/
176
177 Documentation: wiki.onosproject.org
178 Tutorials: tutorials.onosproject.org
179 Mailing lists: lists.onosproject.org
180
181 Come help out! Find out how at: contribute.onosproject.org
182
183 Hit '<tab>' for a list of available commands
184 and '[cmd] --help' for help on a specific command.
185 Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
186
187 karaf@root
188
189You can attach to the ONOS logs by using the ``log:tail`` command:
190
191.. code-block::
192
193 $ karaf@root > log:tail
194 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine1 -> device:leaf1
195 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine2 -> device:leaf1
196 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf1 -> device:spine1
197 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf2 -> device:spine1
198
199The command will display continuously the log entries - this is useful for a live debugging session.
200Complete ONOS logs can be accessed by using ``kubectl logs`` command as explained in the previous section.
201If anything can be figured out from the logs, you can access
202to the ONOS state by issuing specific CLI commands. We report in the section `Frequently Used Commands`_ few commands we frequently use
203when troubleshooting SD-Fabric.
204
pierventre16cc8022021-10-14 10:34:57 +0200205Pipeline Walk-through
206^^^^^^^^^^^^^^^^^^^^^
207.. note::
208 More information of Pipeline Walk-through is coming soon
209
pierventre517cd532021-10-12 22:58:00 +0200210onos-diagnostics
211^^^^^^^^^^^^^^^^
212
213In the case where you can't figure out what is going wrong, you can seek help on SD-Fabric developer mailing list
214``sdfabric-dev@opennetworking.org`` or you can reach out on the ``sdfabric-dev`` Slack channel. There are a few
215things we would like you to attach:
216
217- **Issue description**
218
219- **Environment description**, such as SD-Fabric version, switch model and SDE version
220 version
221
222- **Steps of reproduction**, as detail as possible
223
224- **Diagnostics**.
225
226We have built a tool `onos-diagnostics-k8s <https://wiki.onosproject.org/display/ONOS/ONOS+Remote+Admin+Tools>`_
pierventre16cc8022021-10-14 10:34:57 +0200227to help you easily collect and package ONOS diagnostics. The tool collects various information from the running
228ONOS cluster and packages it into one, easy-to-share archive file. This tool is distributed as part of the ONOS
229software itself (under bin directory), but is also available as part of a small archive of remote tools to administer
230an ONOS cluster (`onos-admin-\*.tar.gz`).
231
232Alternatively, it is possible to use ``onos-diagnostics-k8s`` in Kubernetes enabled environments. The tool will produce
233the same results of onos-diagnostics and relies only on ``kubectl`` commands. The tool need to know the name of
234the namespace and this can be provided through the option ``-s``. Then, you have to provide the names of the target
235pods. To avoid having to specify these names as part of the command, you can export the ``ONOS_PODS`` environment
236variable. Here’s an example of how to set the variable:
237
238.. code-block::
239
240 $ export ONOS_PODS="onos-0 onos-1 onos-2"
241
242The tool needs to know the Karaf home (path from the mount point). To avoid having to specify this path as part
243of the command, you can export the ``KARAF_HOME`` environment variable:
244
245.. code-block::
246
pierventre733bf982022-01-12 21:40:28 +0100247 $ export KARAF_HOME="apache-karaf-4.2.14"
pierventre16cc8022021-10-14 10:34:57 +0200248
249Once done, the ``onos-diagnostics-k8s`` tool can be run as follows:
250
251.. code-block::
252
253 $ onos-diagnostics-k8s -s sdfabric
254
255There is the option ``-n`` that allows for naming the resulting archive file for differentiation between different
256cluster instances, e.g.
257
258.. code-block::
259
260 # This will produce archive file /tmp/delta-pod-diags.tar.gz
261 $ onos-diagnostics-k8s -s sdfabric -n delta-pod
262
263By default ``onos-diagnostics-k8s`` will use ``ONOS_PROFILE`` to collect the diagnostics, you can tailor the behavior of the
264command to your needs by specifying a different `profile <https://github.com/opennetworkinglab/onos/blob/master/tools/package/runtime/bin/onos-diagnostics-profile>`_.
265For SD-Fabric we suggest to use ``TRELLIS_PROFILE``. The resulting `/tmp/\*-diags.tar.gz` file will contain all
266relevant information about the ONOS cluster.
267
268The following is an example of a complete ``onos-diagnostics-k8s`` command:
269
270.. code-block::
271
pierventre733bf982022-01-12 21:40:28 +0100272 $ DIAGS_PROFILE=TRELLIS_PROFILE onos-diagnostics-k8s -k apache-karaf-4.2.14 -s sdfabric onos-tost-onos-classic-0 onos-tost-onos-classic-1 onos-tost-onos-classic-2
pierventre517cd532021-10-12 22:58:00 +0200273
274UP4 Troubleshooting
275-------------------
276
277.. note::
278 More information of UP4 troubleshoot is coming soon
279
280Common Issues
281-------------
282
283.. note::
284 Here is a list of common issues.
285 More details of each case are coming soon
286
287ImagePullBackOff
288^^^^^^^^^^^^^^^^
289
290ONOS pod not ready (1)
291^^^^^^^^^^^^^^^^^^^^^^
292
293ONOS pod not ready (2)
294^^^^^^^^^^^^^^^^^^^^^^
295
296ONOS pods not configured
297^^^^^^^^^^^^^^^^^^^^^^^^
298
299Packet-In not working
300^^^^^^^^^^^^^^^^^^^^^
301
302Device offline
303^^^^^^^^^^^^^^
304
Charles Chanbf55e742021-10-04 17:46:46 -0700305Frequently Used Commands
306------------------------
pierventre517cd532021-10-12 22:58:00 +0200307
308In this subsection, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric.
Charles Chanbf55e742021-10-04 17:46:46 -0700309
310ONOS
311^^^^
312To execute following ONOS CLI commands,
313
pierventre16cc8022021-10-14 10:34:57 +0200314- Create K8s port forwarding by `kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101`
Charles Chanbf55e742021-10-04 17:46:46 -0700315- Login to ONOS CLI by `ssh -p 8101 karaf@localhost`. Default password is `karaf`
316
317ONOS basics
318"""""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800319- ``flows``: List flow tables. `-s` for simplified output.
320- ``groups``: List group tables. `-s` for simplified output.
321- ``devices``: List device information. `-s` for simplified output.
322- ``ports``: List port information. `-e` to list enabled ports only.
323- ``links``: List discovered links
324- ``hosts``: List discovered hosts. `-s` for simplified output.
325- ``netcfg``: List network configuration
326- ``interfaces``: List interface configuration
Charles Chanbf55e742021-10-04 17:46:46 -0700327
328trellis-control
329"""""""""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800330- ``sr-pr-list``: List current recovery phase of each device
331- ``sr-device-subnets``: List device-subnet mapping
Charles Chanbf55e742021-10-04 17:46:46 -0700332
333fabric-tna
334""""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800335- ``slices``: List network slices
336- ``tcs``: List traffic classes of given slice
Charles Chanbf55e742021-10-04 17:46:46 -0700337
338up4
339"""
Charles Chan0bb84212022-02-28 17:17:43 -0800340- ``read-entities -a``: Print UPF entities installed in the UPF dataplane.
341 More options are available. See ``read-entities --help``
342
Charles Chanbf55e742021-10-04 17:46:46 -0700343
344Stratum
345^^^^^^^
Carmelo Cascone43989982021-10-12 00:01:19 -0700346To execute following BF Shell commands,
Charles Chanbf55e742021-10-04 17:46:46 -0700347
348- Login to Stratum switch by `ssh root@<switch_ip>`. Default password is `onl`
349- Attach to Stratum docker container by `docker attach \`docker ps | grep stratum-bfrt | awk \'{print $1}\'\``
350
351 - Hit `enter` for the prompt
352 - Use `<Ctrl-P><Ctrl-Q>` to exit the container. Do not use `<Ctrl-C>` since it will terminate the process.
353
Carmelo Cascone43989982021-10-12 00:01:19 -0700354BF Shell
355""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800356- ``pm.show``: List port configurations. `-a` to list all ports.