blob: 6dccd148e75f0bb132c47bfe902d82a6cdb2e741 [file] [log] [blame]
Charles Chanfcfe8902022-02-02 17:06:27 -08001.. SPDX-FileCopyrightText: 2021 Open Networking Foundation <info@opennetworking.org>
2.. SPDX-License-Identifier: Apache-2.0
3
Daniele Moro5212da62021-10-11 16:20:26 +02004.. _troubleshooting_guide:
5
Charles Chancaebcf32021-09-20 22:17:52 -07006Troubleshooting Guide
7=====================
Charles Chanbf55e742021-10-04 17:46:46 -07008
pierventre517cd532021-10-12 22:58:00 +02009In this section we are going to provide hints and useful commands to help you troubleshoot traffic-related problems
10or k8s related issues. It is important to remember that these two types of issues are highly related as both
11control plane software and data plane software are containerized and deployed as Kubernetes services in SD-Fabric.
12Please refer to :ref:`architecture_design` for further details.
13
Tseng, Yi83f293e2022-08-15 17:55:47 -070014SONiC troubleshooting
15---------------------
Charles Chanb7323682022-03-02 12:33:15 -080016
Tseng, Yi83f293e2022-08-15 17:55:47 -070017Can't reboot into SONiC, loops on ONIE installer mode
18^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Charles Chanb7323682022-03-02 12:33:15 -080019
Tseng, Yi83f293e2022-08-15 17:55:47 -070020Sometimes an SONiC installation is incomplete or problematic, and reinstalling it
Charles Chanb7323682022-03-02 12:33:15 -080021doesn't result in a working system.
22
23If this is the case, reboot into ONIE Rescue mode and use ``parted`` to delete
Tseng, Yi83f293e2022-08-15 17:55:47 -070024all the ``SONiC`` related partitions, then reinstall the SONiC image.
Charles Chanb7323682022-03-02 12:33:15 -080025
pierventre517cd532021-10-12 22:58:00 +020026K8s troubleshooting
27-------------------
28
29We assume that the tool ``kubectl`` have been install already on your local machine.
30First step is to setup the proper ``kubeconfig`` file to access the k8s cluster you want to troubleshoot:
31
32.. code-block::
33
34 $ export KUBECONFIG=~/kubeconfig/dev-sdfabric-menlo
35 $ kubectl config use-context dev-sdfabric-menlo
36 Switched to context "dev-sdfabric-menlo".
37
38You can get the list of the k8s namespaces using ``kubectl get`` command:
39
40.. code-block::
41
42 $ kubectl get namespaces
43 ...
44 kube-node-lease Active 68d
45 kube-public Active 68d
46 kube-system Active 68d
47 security-scan Active 68d
48 sdfabric Active 26h
49
50Let's assume that SD-Fabric resources are deployed under the namespace ``sdfabric``, so make sure that the ``sdfabric``
51namespace has been properly created (additionally other namespaces could be created - please check your overarching chart).
52
53If the deployment is not successful,
54a first check is to make sure there are enough available nodes in the target cluster.
55You can check the available nodes through ``kubectl get nodes`` command:
56
57.. code-block::
58
59 $ kubectl get nodes -o wide
60 NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
61 compute1 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.74 <none> Ubuntu 18.04.6 LTS 5.4.0-73-generic docker://20.10.9
62 compute2 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.72 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
63 compute3 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.68 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
Tseng, Yi83f293e2022-08-15 17:55:47 -070064 leaf1 Ready worker 39d v1.18.8 10.76.28.70 <none> Debian GNU/Linux 10 (buster) 4.19.0-12-2-amd64 docker://18.9.8
65 leaf2 Ready worker 39d v1.18.8 10.76.28.71 <none> Debian GNU/Linux 10 (buster) 4.19.0-12-2-amd64 docker://18.9.8
pierventre517cd532021-10-12 22:58:00 +020066
67You should have at least `3+N` available nodes, where N depends on the deployed network topology. Please note that ONOS
68cannot be scheduled on the network devices (these are special worker nodes), and different ONOS cannot share the same worker
69node (the same applies for Atomix).
70
71At least you should have some basic containers that are present in each deployment.
72You can get the list of the pods by using ``kubectl get pods -n sdfabric``:
73
74.. code-block::
75
76 $ kubectl get pods -n sdfabric -o wide
77 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
78 onos-tost-atomix-0 1/1 Running 0 6h31m 10.72.106.161 compute3 <none> <none>
79 onos-tost-atomix-1 1/1 Running 0 6h31m 10.72.111.229 compute1 <none> <none>
80 onos-tost-atomix-2 1/1 Running 0 6h31m 10.72.75.254 compute2 <none> <none>
81 onos-tost-onos-classic-0 1/1 Running 0 98m 10.72.106.133 compute3 <none> <none>
82 onos-tost-onos-classic-1 1/1 Running 0 6h31m 10.72.111.207 compute1 <none> <none>
83 onos-tost-onos-classic-2 1/1 Running 0 6h31m 10.72.75.247 compute2 <none> <none>
84 onos-tost-onos-classic-onos-config-loader-ddc9d68bb-lq97t 1/1 Running 0 6h19m 10.72.106.190 compute3 <none> <none>
85 stratum-bwlvh 1/1 Running 0 6h31m 10.76.28.70 leaf1 <none> <none>
86 stratum-gh842 1/1 Running 0 6h31m 10.76.28.71 leaf2 <none> <none>
87
883 Atomix nodes and 3 ONOS nodes are needed for HA. `onos-config-loader` is equally important, because without ONOS
89cannot be properly configured. The number of Stratum pods depend on the deployed topology. If the status of the pods
90is not `Running` you can check the events published by k8s components to have a first idea of what is happening:
91
92.. code-block::
93
94 $ kubectl get events -n sdfabric --sort-by='.lastTimestamp'
95 LAST SEEN TYPE REASON OBJECT MESSAGE
96 12m Normal Scheduled pod/telegraf-75b959574d-sl8qb Successfully assigned tost/telegraf-75b959574d-sl8qb to compute3
97 12m Normal SuccessfulCreate replicaset/telegraf-75b959574d Created pod: telegraf-75b959574d-sl8qb
98 12m Normal ScalingReplicaSet deployment/telegraf Scaled up replica set telegraf-75b959574d to 1
99 12m Normal Pulled pod/telegraf-75b959574d-sl8qb Container image "telegraf:1.17" already present on machine
100 12m Normal AddedInterface pod/telegraf-75b959574d-sl8qb Add eth0 [10.72.106.153/32]
101 12m Normal Started pod/telegraf-75b959574d-sl8qb Started container telegraf
102 12m Normal Created pod/telegraf-75b959574d-sl8qb Created container telegraf
103 ...
104
105The option ``--sort-by='.lastTimestamp'`` is typically used to get the events sorted by time. The previous command
106will report all the events happened in the ``sdfabric`` namespace, if you want to have more insights on a specific
107pod, it is possible to use the command ``kubectl describe pods``:
108
109.. code-block::
110
111 $ kubectl describe pods -n sdfabric onos-tost-onos-classic-0
112 Name: onos-tost-onos-classic-0
113 Namespace: sdfabric
114 Priority: 0
115 Node: compute3/10.76.28.68
116 Start Time: Mon, 11 Oct 2021 10:35:43 +0200
117 ...
118 Events:
119 Type Reason Age From Message
120 ...
121 {"message":"pending"}
122 org.onosproject.segmentrouting is not yet ready
123
124The ``Events`` section provides typically useful information about the issues the pod is facing.
125
126Both ONOS and Atomix define readiness probes which will make sure that the pods are ready before any configuration
127will take place. As consequence of this, if the probes fail for a given pod you will notice in the output of the command
128``kubectl get pods``` near its name ``0/1`` under the column ``READY``. We report in `ONOS pod not ready (1)`_ and
129`ONOS pod not ready (2)`_ two scenarios frequently faced by the SD-Fabric developers.
130
131Logs of the SD-Fabric pods can be accessed by using ``kubectl logs`` command
132
133.. code-block::
134
135 $ kubectl -n sdfabric logs onos-tost-onos-classic-0
136 2021-10-12 04:46:17,955 INFO [EventAdminConfigurationNotifier] Sending Event Admin notification (configuration successful) to org/ops4j/pax/logging/Configuration
137 ...
138 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Changes to perform:
139 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Region: root
140 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Bundles to install:
141
142
143ONOS Troubleshooting
144--------------------
145
146You can get the ONOS CLI by establishing SSH connection to the port ``8101`` (default password is `karaf`):
147
148.. code-block::
149
pierventre16cc8022021-10-14 10:34:57 +0200150 $ kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101
pierventre517cd532021-10-12 22:58:00 +0200151 // In another terminal or you can send to /dev/null the port-forward
152 $ ssh -p 8101 karaf@localhost
153 The authenticity of host '[localhost]:8101 ([127.0.0.1]:8101)' can't be established.
154 RSA key fingerprint is SHA256:Mlaax9tHmIR6WwK0B3okC1O4mpAuoXjI7Z5+KKelxOo.
155 Are you sure you want to continue connecting (yes/no)? yes
156 Warning: Permanently added '[localhost]:8101' (RSA) to the list of known hosts.
157 Password authentication
158 Password:
159 Welcome to Open Network Operating System (ONOS)!
160 ____ _ ______ ____
161 / __ \/ |/ / __ \/ __/
162 / /_/ / / /_/ /\ \
163 \____/_/|_/\____/___/
164
165 Documentation: wiki.onosproject.org
166 Tutorials: tutorials.onosproject.org
167 Mailing lists: lists.onosproject.org
168
169 Come help out! Find out how at: contribute.onosproject.org
170
171 Hit '<tab>' for a list of available commands
172 and '[cmd] --help' for help on a specific command.
173 Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
174
175 karaf@root >
176
177Alternatively, if this is not possible to establish an ssh connection with the ONOS pods,
178it is possible to use ``kubectl exec`` command on the target pod:
179
180.. code-block::
181
pierventre733bf982022-01-12 21:40:28 +0100182 $ kubectl -n sdfabric exec -it onos-tost-onos-classic-0 -- bash apache-karaf-4.2.14/bin/client
pierventre517cd532021-10-12 22:58:00 +0200183 Welcome to Open Network Operating System (ONOS)!
184 ____ _ ______ ____
185 / __ \/ |/ / __ \/ __/
186 / /_/ / / /_/ /\ \
187 \____/_/|_/\____/___/
188
189 Documentation: wiki.onosproject.org
190 Tutorials: tutorials.onosproject.org
191 Mailing lists: lists.onosproject.org
192
193 Come help out! Find out how at: contribute.onosproject.org
194
195 Hit '<tab>' for a list of available commands
196 and '[cmd] --help' for help on a specific command.
197 Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
198
199 karaf@root
200
201You can attach to the ONOS logs by using the ``log:tail`` command:
202
203.. code-block::
204
205 $ karaf@root > log:tail
206 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine1 -> device:leaf1
207 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine2 -> device:leaf1
208 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf1 -> device:spine1
209 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf2 -> device:spine1
210
211The command will display continuously the log entries - this is useful for a live debugging session.
212Complete ONOS logs can be accessed by using ``kubectl logs`` command as explained in the previous section.
213If anything can be figured out from the logs, you can access
214to the ONOS state by issuing specific CLI commands. We report in the section `Frequently Used Commands`_ few commands we frequently use
215when troubleshooting SD-Fabric.
216
pierventre16cc8022021-10-14 10:34:57 +0200217Pipeline Walk-through
218^^^^^^^^^^^^^^^^^^^^^
219.. note::
220 More information of Pipeline Walk-through is coming soon
221
pierventre517cd532021-10-12 22:58:00 +0200222onos-diagnostics
223^^^^^^^^^^^^^^^^
224
225In the case where you can't figure out what is going wrong, you can seek help on SD-Fabric developer mailing list
226``sdfabric-dev@opennetworking.org`` or you can reach out on the ``sdfabric-dev`` Slack channel. There are a few
227things we would like you to attach:
228
229- **Issue description**
230
231- **Environment description**, such as SD-Fabric version, switch model and SDE version
232 version
233
234- **Steps of reproduction**, as detail as possible
235
236- **Diagnostics**.
237
238We have built a tool `onos-diagnostics-k8s <https://wiki.onosproject.org/display/ONOS/ONOS+Remote+Admin+Tools>`_
pierventre16cc8022021-10-14 10:34:57 +0200239to help you easily collect and package ONOS diagnostics. The tool collects various information from the running
240ONOS cluster and packages it into one, easy-to-share archive file. This tool is distributed as part of the ONOS
241software itself (under bin directory), but is also available as part of a small archive of remote tools to administer
242an ONOS cluster (`onos-admin-\*.tar.gz`).
243
244Alternatively, it is possible to use ``onos-diagnostics-k8s`` in Kubernetes enabled environments. The tool will produce
245the same results of onos-diagnostics and relies only on ``kubectl`` commands. The tool need to know the name of
246the namespace and this can be provided through the option ``-s``. Then, you have to provide the names of the target
247pods. To avoid having to specify these names as part of the command, you can export the ``ONOS_PODS`` environment
248variable. Here’s an example of how to set the variable:
249
250.. code-block::
251
252 $ export ONOS_PODS="onos-0 onos-1 onos-2"
253
254The tool needs to know the Karaf home (path from the mount point). To avoid having to specify this path as part
255of the command, you can export the ``KARAF_HOME`` environment variable:
256
257.. code-block::
258
pierventre733bf982022-01-12 21:40:28 +0100259 $ export KARAF_HOME="apache-karaf-4.2.14"
pierventre16cc8022021-10-14 10:34:57 +0200260
261Once done, the ``onos-diagnostics-k8s`` tool can be run as follows:
262
263.. code-block::
264
265 $ onos-diagnostics-k8s -s sdfabric
266
267There is the option ``-n`` that allows for naming the resulting archive file for differentiation between different
268cluster instances, e.g.
269
270.. code-block::
271
272 # This will produce archive file /tmp/delta-pod-diags.tar.gz
273 $ onos-diagnostics-k8s -s sdfabric -n delta-pod
274
275By default ``onos-diagnostics-k8s`` will use ``ONOS_PROFILE`` to collect the diagnostics, you can tailor the behavior of the
276command to your needs by specifying a different `profile <https://github.com/opennetworkinglab/onos/blob/master/tools/package/runtime/bin/onos-diagnostics-profile>`_.
277For SD-Fabric we suggest to use ``TRELLIS_PROFILE``. The resulting `/tmp/\*-diags.tar.gz` file will contain all
278relevant information about the ONOS cluster.
279
280The following is an example of a complete ``onos-diagnostics-k8s`` command:
281
282.. code-block::
283
pierventre733bf982022-01-12 21:40:28 +0100284 $ DIAGS_PROFILE=TRELLIS_PROFILE onos-diagnostics-k8s -k apache-karaf-4.2.14 -s sdfabric onos-tost-onos-classic-0 onos-tost-onos-classic-1 onos-tost-onos-classic-2
pierventre517cd532021-10-12 22:58:00 +0200285
286UP4 Troubleshooting
287-------------------
288
289.. note::
290 More information of UP4 troubleshoot is coming soon
291
292Common Issues
293-------------
294
295.. note::
296 Here is a list of common issues.
297 More details of each case are coming soon
298
299ImagePullBackOff
300^^^^^^^^^^^^^^^^
301
302ONOS pod not ready (1)
303^^^^^^^^^^^^^^^^^^^^^^
304
305ONOS pod not ready (2)
306^^^^^^^^^^^^^^^^^^^^^^
307
308ONOS pods not configured
309^^^^^^^^^^^^^^^^^^^^^^^^
310
311Packet-In not working
312^^^^^^^^^^^^^^^^^^^^^
313
314Device offline
315^^^^^^^^^^^^^^
316
Charles Chanbf55e742021-10-04 17:46:46 -0700317Frequently Used Commands
318------------------------
pierventre517cd532021-10-12 22:58:00 +0200319
320In this subsection, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric.
Charles Chanbf55e742021-10-04 17:46:46 -0700321
322ONOS
323^^^^
324To execute following ONOS CLI commands,
325
pierventre16cc8022021-10-14 10:34:57 +0200326- Create K8s port forwarding by `kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101`
Charles Chanbf55e742021-10-04 17:46:46 -0700327- Login to ONOS CLI by `ssh -p 8101 karaf@localhost`. Default password is `karaf`
328
329ONOS basics
330"""""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800331- ``flows``: List flow tables. `-s` for simplified output.
332- ``groups``: List group tables. `-s` for simplified output.
333- ``devices``: List device information. `-s` for simplified output.
334- ``ports``: List port information. `-e` to list enabled ports only.
335- ``links``: List discovered links
336- ``hosts``: List discovered hosts. `-s` for simplified output.
337- ``netcfg``: List network configuration
338- ``interfaces``: List interface configuration
Charles Chanbf55e742021-10-04 17:46:46 -0700339
340trellis-control
341"""""""""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800342- ``sr-pr-list``: List current recovery phase of each device
343- ``sr-device-subnets``: List device-subnet mapping
Charles Chanbf55e742021-10-04 17:46:46 -0700344
345fabric-tna
346""""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800347- ``slices``: List network slices
348- ``tcs``: List traffic classes of given slice
Charles Chanbf55e742021-10-04 17:46:46 -0700349
350up4
351"""
Charles Chan0bb84212022-02-28 17:17:43 -0800352- ``read-entities -a``: Print UPF entities installed in the UPF dataplane.
353 More options are available. See ``read-entities --help``
354
Charles Chanbf55e742021-10-04 17:46:46 -0700355
356Stratum
357^^^^^^^
Carmelo Cascone43989982021-10-12 00:01:19 -0700358To execute following BF Shell commands,
Charles Chanbf55e742021-10-04 17:46:46 -0700359
Tseng, Yi83f293e2022-08-15 17:55:47 -0700360- Login to Stratum switch via `ssh`.
Charles Chanbf55e742021-10-04 17:46:46 -0700361- Attach to Stratum docker container by `docker attach \`docker ps | grep stratum-bfrt | awk \'{print $1}\'\``
362
363 - Hit `enter` for the prompt
364 - Use `<Ctrl-P><Ctrl-Q>` to exit the container. Do not use `<Ctrl-C>` since it will terminate the process.
365
Carmelo Cascone43989982021-10-12 00:01:19 -0700366BF Shell
367""""""""
Charles Chan0bb84212022-02-28 17:17:43 -0800368- ``pm.show``: List port configurations. `-a` to list all ports.