blob: 981072e88e57d009945ba35991285e9eb5f543fc [file] [log] [blame]
Charles Chanfcfe8902022-02-02 17:06:27 -08001.. SPDX-FileCopyrightText: 2021 Open Networking Foundation <info@opennetworking.org>
2.. SPDX-License-Identifier: Apache-2.0
3
Daniele Moro5212da62021-10-11 16:20:26 +02004.. _troubleshooting_guide:
5
Charles Chancaebcf32021-09-20 22:17:52 -07006Troubleshooting Guide
7=====================
Charles Chanbf55e742021-10-04 17:46:46 -07008
pierventre517cd532021-10-12 22:58:00 +02009In this section we are going to provide hints and useful commands to help you troubleshoot traffic-related problems
10or k8s related issues. It is important to remember that these two types of issues are highly related as both
11control plane software and data plane software are containerized and deployed as Kubernetes services in SD-Fabric.
12Please refer to :ref:`architecture_design` for further details.
13
Charles Chanb7323682022-03-02 12:33:15 -080014ONL troubleshooting
15-------------------
16
17Can't reboot into ONL, loops on ONIE installer mode
18^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
19
20Sometimes an ONL installation is incomplete or problematic, and reinstalling it
21doesn't result in a working system.
22
23If this is the case, reboot into ONIE Rescue mode and use ``parted`` to delete
24all the ``ONL-`` prefixed partitions, then reinstall with an ``onie-installer``
25image.
26
pierventre517cd532021-10-12 22:58:00 +020027K8s troubleshooting
28-------------------
29
30We assume that the tool ``kubectl`` have been install already on your local machine.
31First step is to setup the proper ``kubeconfig`` file to access the k8s cluster you want to troubleshoot:
32
33.. code-block::
34
35 $ export KUBECONFIG=~/kubeconfig/dev-sdfabric-menlo
36 $ kubectl config use-context dev-sdfabric-menlo
37 Switched to context "dev-sdfabric-menlo".
38
39You can get the list of the k8s namespaces using ``kubectl get`` command:
40
41.. code-block::
42
43 $ kubectl get namespaces
44 ...
45 kube-node-lease Active 68d
46 kube-public Active 68d
47 kube-system Active 68d
48 security-scan Active 68d
49 sdfabric Active 26h
50
51Let's assume that SD-Fabric resources are deployed under the namespace ``sdfabric``, so make sure that the ``sdfabric``
52namespace has been properly created (additionally other namespaces could be created - please check your overarching chart).
53
54If the deployment is not successful,
55a first check is to make sure there are enough available nodes in the target cluster.
56You can check the available nodes through ``kubectl get nodes`` command:
57
58.. code-block::
59
60 $ kubectl get nodes -o wide
61 NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
62 compute1 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.74 <none> Ubuntu 18.04.6 LTS 5.4.0-73-generic docker://20.10.9
63 compute2 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.72 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
64 compute3 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.68 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
65 leaf1 Ready worker 39d v1.18.8 10.76.28.70 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15
66 leaf2 Ready worker 39d v1.18.8 10.76.28.71 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15
67
68You should have at least `3+N` available nodes, where N depends on the deployed network topology. Please note that ONOS
69cannot be scheduled on the network devices (these are special worker nodes), and different ONOS cannot share the same worker
70node (the same applies for Atomix).
71
72At least you should have some basic containers that are present in each deployment.
73You can get the list of the pods by using ``kubectl get pods -n sdfabric``:
74
75.. code-block::
76
77 $ kubectl get pods -n sdfabric -o wide
78 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
79 onos-tost-atomix-0 1/1 Running 0 6h31m 10.72.106.161 compute3 <none> <none>
80 onos-tost-atomix-1 1/1 Running 0 6h31m 10.72.111.229 compute1 <none> <none>
81 onos-tost-atomix-2 1/1 Running 0 6h31m 10.72.75.254 compute2 <none> <none>
82 onos-tost-onos-classic-0 1/1 Running 0 98m 10.72.106.133 compute3 <none> <none>
83 onos-tost-onos-classic-1 1/1 Running 0 6h31m 10.72.111.207 compute1 <none> <none>
84 onos-tost-onos-classic-2 1/1 Running 0 6h31m 10.72.75.247 compute2 <none> <none>
85 onos-tost-onos-classic-onos-config-loader-ddc9d68bb-lq97t 1/1 Running 0 6h19m 10.72.106.190 compute3 <none> <none>
86 stratum-bwlvh 1/1 Running 0 6h31m 10.76.28.70 leaf1 <none> <none>
87 stratum-gh842 1/1 Running 0 6h31m 10.76.28.71 leaf2 <none> <none>
88
893 Atomix nodes and 3 ONOS nodes are needed for HA. `onos-config-loader` is equally important, because without ONOS
90cannot be properly configured. The number of Stratum pods depend on the deployed topology. If the status of the pods
91is not `Running` you can check the events published by k8s components to have a first idea of what is happening:
92
93.. code-block::
94
95 $ kubectl get events -n sdfabric --sort-by='.lastTimestamp'
96 LAST SEEN TYPE REASON OBJECT MESSAGE
97 12m Normal Scheduled pod/telegraf-75b959574d-sl8qb Successfully assigned tost/telegraf-75b959574d-sl8qb to compute3
98 12m Normal SuccessfulCreate replicaset/telegraf-75b959574d Created pod: telegraf-75b959574d-sl8qb
99 12m Normal ScalingReplicaSet deployment/telegraf Scaled up replica set telegraf-75b959574d to 1
100 12m Normal Pulled pod/telegraf-75b959574d-sl8qb Container image "telegraf:1.17" already present on machine
101 12m Normal AddedInterface pod/telegraf-75b959574d-sl8qb Add eth0 [10.72.106.153/32]
102 12m Normal Started pod/telegraf-75b959574d-sl8qb Started container telegraf
103 12m Normal Created pod/telegraf-75b959574d-sl8qb Created container telegraf
104 ...
105
106The option ``--sort-by='.lastTimestamp'`` is typically used to get the events sorted by time. The previous command
107will report all the events happened in the ``sdfabric`` namespace, if you want to have more insights on a specific
108pod, it is possible to use the command ``kubectl describe pods``:
109
110.. code-block::
111
112 $ kubectl describe pods -n sdfabric onos-tost-onos-classic-0
113 Name: onos-tost-onos-classic-0
114 Namespace: sdfabric
115 Priority: 0
116 Node: compute3/10.76.28.68
117 Start Time: Mon, 11 Oct 2021 10:35:43 +0200
118 ...
119 Events:
120 Type Reason Age From Message
121 ...
122 {"message":"pending"}
123 org.onosproject.segmentrouting is not yet ready
124
125The ``Events`` section provides typically useful information about the issues the pod is facing.
126
127Both ONOS and Atomix define readiness probes which will make sure that the pods are ready before any configuration
128will take place. As consequence of this, if the probes fail for a given pod you will notice in the output of the command
129``kubectl get pods``` near its name ``0/1`` under the column ``READY``. We report in `ONOS pod not ready (1)`_ and
130`ONOS pod not ready (2)`_ two scenarios frequently faced by the SD-Fabric developers.
131
132Logs of the SD-Fabric pods can be accessed by using ``kubectl logs`` command
133
134.. code-block::
135
136 $ kubectl -n sdfabric logs onos-tost-onos-classic-0
137 2021-10-12 04:46:17,955 INFO [EventAdminConfigurationNotifier] Sending Event Admin notification (configuration successful) to org/ops4j/pax/logging/Configuration
138 ...
139 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Changes to perform:
140 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Region: root
141 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Bundles to install:
142
143
144ONOS Troubleshooting
145--------------------
146
147You can get the ONOS CLI by establishing SSH connection to the port ``8101`` (default password is `karaf`):
148
149.. code-block::
150
pierventre16cc8022021-10-14 10:34:57 +0200151 $ kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101
pierventre517cd532021-10-12 22:58:00 +0200152 // In another terminal or you can send to /dev/null the port-forward
153 $ ssh -p 8101 karaf@localhost
154 The authenticity of host '[localhost]:8101 ([127.0.0.1]:8101)' can't be established.
155 RSA key fingerprint is SHA256:Mlaax9tHmIR6WwK0B3okC1O4mpAuoXjI7Z5+KKelxOo.
156 Are you sure you want to continue connecting (yes/no)? yes
157 Warning: Permanently added '[localhost]:8101' (RSA) to the list of known hosts.
158 Password authentication
159 Password:
160 Welcome to Open Network Operating System (ONOS)!
161 ____ _ ______ ____
162 / __ \/ |/ / __ \/ __/
163 / /_/ / / /_/ /\ \
164 \____/_/|_/\____/___/
165
166 Documentation: wiki.onosproject.org
167 Tutorials: tutorials.onosproject.org
168 Mailing lists: lists.onosproject.org
169
170 Come help out! Find out how at: contribute.onosproject.org
171
172 Hit '<tab>' for a list of available commands
173 and '[cmd] --help' for help on a specific command.
174 Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
175
176 karaf@root >
177
178Alternatively, if this is not possible to establish an ssh connection with the ONOS pods,
179it is possible to use ``kubectl exec`` command on the target pod:
180
181.. code-block::
182
pierventre733bf982022-01-12 21:40:28 +0100183 $ kubectl -n sdfabric exec -it onos-tost-onos-classic-0 -- bash apache-karaf-4.2.14/bin/client
pierventre517cd532021-10-12 22:58:00 +0200184 Welcome to Open Network Operating System (ONOS)!
185 ____ _ ______ ____
186 / __ \/ |/ / __ \/ __/
187 / /_/ / / /_/ /\ \
188 \____/_/|_/\____/___/
189
190 Documentation: wiki.onosproject.org
191 Tutorials: tutorials.onosproject.org
192 Mailing lists: lists.onosproject.org
193
194 Come help out! Find out how at: contribute.onosproject.org
195
196 Hit '<tab>' for a list of available commands
197 and '[cmd] --help' for help on a specific command.
198 Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
199
200 karaf@root
201
202You can attach to the ONOS logs by using the ``log:tail`` command:
203
204.. code-block::
205
206 $ karaf@root > log:tail
207 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine1 -> device:leaf1
208 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine2 -> device:leaf1
209 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf1 -> device:spine1
210 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf2 -> device:spine1
211
212The command will display continuously the log entries - this is useful for a live debugging session.
213Complete ONOS logs can be accessed by using ``kubectl logs`` command as explained in the previous section.
214If anything can be figured out from the logs, you can access
215to the ONOS state by issuing specific CLI commands. We report in the section `Frequently Used Commands`_ few commands we frequently use
216when troubleshooting SD-Fabric.
217
pierventre16cc8022021-10-14 10:34:57 +0200218Pipeline Walk-through
219^^^^^^^^^^^^^^^^^^^^^
220.. note::
221 More information of Pipeline Walk-through is coming soon
222
pierventre517cd532021-10-12 22:58:00 +0200223onos-diagnostics
224^^^^^^^^^^^^^^^^
225
226In the case where you can't figure out what is going wrong, you can seek help on SD-Fabric developer mailing list
227``sdfabric-dev@opennetworking.org`` or you can reach out on the ``sdfabric-dev`` Slack channel. There are a few
228things we would like you to attach:
229
230- **Issue description**
231
232- **Environment description**, such as SD-Fabric version, switch model and SDE version
233 version
234
235- **Steps of reproduction**, as detail as possible
236
237- **Diagnostics**.
238
239We have built a tool `onos-diagnostics-k8s <https://wiki.onosproject.org/display/ONOS/ONOS+Remote+Admin+Tools>`_
pierventre16cc8022021-10-14 10:34:57 +0200240to help you easily collect and package ONOS diagnostics. The tool collects various information from the running
241ONOS cluster and packages it into one, easy-to-share archive file. This tool is distributed as part of the ONOS
242software itself (under bin directory), but is also available as part of a small archive of remote tools to administer
243an ONOS cluster (`onos-admin-\*.tar.gz`).
244
245Alternatively, it is possible to use ``onos-diagnostics-k8s`` in Kubernetes enabled environments. The tool will produce
246the same results of onos-diagnostics and relies only on ``kubectl`` commands. The tool need to know the name of
247the namespace and this can be provided through the option ``-s``. Then, you have to provide the names of the target
248pods. To avoid having to specify these names as part of the command, you can export the ``ONOS_PODS`` environment
249variable. Here’s an example of how to set the variable:
250
251.. code-block::
252
253 $ export ONOS_PODS="onos-0 onos-1 onos-2"
254
255The tool needs to know the Karaf home (path from the mount point). To avoid having to specify this path as part
256of the command, you can export the ``KARAF_HOME`` environment variable:
257
258.. code-block::
259
pierventre733bf982022-01-12 21:40:28 +0100260 $ export KARAF_HOME="apache-karaf-4.2.14"
pierventre16cc8022021-10-14 10:34:57 +0200261
262Once done, the ``onos-diagnostics-k8s`` tool can be run as follows:
263
264.. code-block::
265
266 $ onos-diagnostics-k8s -s sdfabric
267
268There is the option ``-n`` that allows for naming the resulting archive file for differentiation between different
269cluster instances, e.g.
270
271.. code-block::
272
273 # This will produce archive file /tmp/delta-pod-diags.tar.gz
274 $ onos-diagnostics-k8s -s sdfabric -n delta-pod
275
276By default ``onos-diagnostics-k8s`` will use ``ONOS_PROFILE`` to collect the diagnostics, you can tailor the behavior of the
277command to your needs by specifying a different `profile <https://github.com/opennetworkinglab/onos/blob/master/tools/package/runtime/bin/onos-diagnostics-profile>`_.
278For SD-Fabric we suggest to use ``TRELLIS_PROFILE``. The resulting `/tmp/\*-diags.tar.gz` file will contain all
279relevant information about the ONOS cluster.
280
281The following is an example of a complete ``onos-diagnostics-k8s`` command:
282
283.. code-block::
284
pierventre733bf982022-01-12 21:40:28 +0100285 $ DIAGS_PROFILE=TRELLIS_PROFILE onos-diagnostics-k8s -k apache-karaf-4.2.14 -s sdfabric onos-tost-onos-classic-0 onos-tost-onos-classic-1 onos-tost-onos-classic-2
pierventre517cd532021-10-12 22:58:00 +0200286
287UP4 Troubleshooting
288-------------------
289
290.. note::
291 More information of UP4 troubleshoot is coming soon
292
293Common Issues
294-------------
295
296.. note::
297 Here is a list of common issues.
298 More details of each case are coming soon
299
300ImagePullBackOff
301^^^^^^^^^^^^^^^^
302
303ONOS pod not ready (1)
304^^^^^^^^^^^^^^^^^^^^^^
305
306ONOS pod not ready (2)
307^^^^^^^^^^^^^^^^^^^^^^
308
309ONOS pods not configured
310^^^^^^^^^^^^^^^^^^^^^^^^
311
312Packet-In not working
313^^^^^^^^^^^^^^^^^^^^^
314
315Device offline
316^^^^^^^^^^^^^^
317
Charles Chanbf55e742021-10-04 17:46:46 -0700318Frequently Used Commands
319------------------------
pierventre517cd532021-10-12 22:58:00 +0200320
321In this subsection, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric.
Charles Chanbf55e742021-10-04 17:46:46 -0700322
323ONOS
324^^^^
325To execute following ONOS CLI commands,
326
pierventre16cc8022021-10-14 10:34:57 +0200327- Create K8s port forwarding by `kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101`
Charles Chanbf55e742021-10-04 17:46:46 -0700328- Login to ONOS CLI by `ssh -p 8101 karaf@localhost`. Default password is `karaf`
329
330ONOS basics
331"""""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800332- ``flows``: List flow tables. `-s` for simplified output.
333- ``groups``: List group tables. `-s` for simplified output.
334- ``devices``: List device information. `-s` for simplified output.
335- ``ports``: List port information. `-e` to list enabled ports only.
336- ``links``: List discovered links
337- ``hosts``: List discovered hosts. `-s` for simplified output.
338- ``netcfg``: List network configuration
339- ``interfaces``: List interface configuration
Charles Chanbf55e742021-10-04 17:46:46 -0700340
341trellis-control
342"""""""""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800343- ``sr-pr-list``: List current recovery phase of each device
344- ``sr-device-subnets``: List device-subnet mapping
Charles Chanbf55e742021-10-04 17:46:46 -0700345
346fabric-tna
347""""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800348- ``slices``: List network slices
349- ``tcs``: List traffic classes of given slice
Charles Chanbf55e742021-10-04 17:46:46 -0700350
351up4
352"""
Charles Chan0bb84212022-02-28 17:17:43 -0800353- ``read-entities -a``: Print UPF entities installed in the UPF dataplane.
354 More options are available. See ``read-entities --help``
355
Charles Chanbf55e742021-10-04 17:46:46 -0700356
357Stratum
358^^^^^^^
Carmelo Cascone43989982021-10-12 00:01:19 -0700359To execute following BF Shell commands,
Charles Chanbf55e742021-10-04 17:46:46 -0700360
361- Login to Stratum switch by `ssh root@<switch_ip>`. Default password is `onl`
362- Attach to Stratum docker container by `docker attach \`docker ps | grep stratum-bfrt | awk \'{print $1}\'\``
363
364 - Hit `enter` for the prompt
365 - Use `<Ctrl-P><Ctrl-Q>` to exit the container. Do not use `<Ctrl-C>` since it will terminate the process.
366
Carmelo Cascone43989982021-10-12 00:01:19 -0700367BF Shell
368""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800369- ``pm.show``: List port configurations. `-a` to list all ports.