blob: 77208ddee3fea605bb3ab9b6ba6dff0f41c33f44 [file] [log] [blame]
Daniele Moro5212da62021-10-11 16:20:26 +02001.. _troubleshooting_guide:
2
Charles Chancaebcf32021-09-20 22:17:52 -07003Troubleshooting Guide
4=====================
Charles Chanbf55e742021-10-04 17:46:46 -07005
pierventre517cd532021-10-12 22:58:00 +02006In this section we are going to provide hints and useful commands to help you troubleshoot traffic-related problems
7or k8s related issues. It is important to remember that these two types of issues are highly related as both
8control plane software and data plane software are containerized and deployed as Kubernetes services in SD-Fabric.
9Please refer to :ref:`architecture_design` for further details.
10
11K8s troubleshooting
12-------------------
13
14We assume that the tool ``kubectl`` have been install already on your local machine.
15First step is to setup the proper ``kubeconfig`` file to access the k8s cluster you want to troubleshoot:
16
17.. code-block::
18
19 $ export KUBECONFIG=~/kubeconfig/dev-sdfabric-menlo
20 $ kubectl config use-context dev-sdfabric-menlo
21 Switched to context "dev-sdfabric-menlo".
22
23You can get the list of the k8s namespaces using ``kubectl get`` command:
24
25.. code-block::
26
27 $ kubectl get namespaces
28 ...
29 kube-node-lease Active 68d
30 kube-public Active 68d
31 kube-system Active 68d
32 security-scan Active 68d
33 sdfabric Active 26h
34
35Let's assume that SD-Fabric resources are deployed under the namespace ``sdfabric``, so make sure that the ``sdfabric``
36namespace has been properly created (additionally other namespaces could be created - please check your overarching chart).
37
38If the deployment is not successful,
39a first check is to make sure there are enough available nodes in the target cluster.
40You can check the available nodes through ``kubectl get nodes`` command:
41
42.. code-block::
43
44 $ kubectl get nodes -o wide
45 NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
46 compute1 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.74 <none> Ubuntu 18.04.6 LTS 5.4.0-73-generic docker://20.10.9
47 compute2 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.72 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
48 compute3 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.68 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
49 leaf1 Ready worker 39d v1.18.8 10.76.28.70 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15
50 leaf2 Ready worker 39d v1.18.8 10.76.28.71 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15
51
52You should have at least `3+N` available nodes, where N depends on the deployed network topology. Please note that ONOS
53cannot be scheduled on the network devices (these are special worker nodes), and different ONOS cannot share the same worker
54node (the same applies for Atomix).
55
56At least you should have some basic containers that are present in each deployment.
57You can get the list of the pods by using ``kubectl get pods -n sdfabric``:
58
59.. code-block::
60
61 $ kubectl get pods -n sdfabric -o wide
62 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
63 onos-tost-atomix-0 1/1 Running 0 6h31m 10.72.106.161 compute3 <none> <none>
64 onos-tost-atomix-1 1/1 Running 0 6h31m 10.72.111.229 compute1 <none> <none>
65 onos-tost-atomix-2 1/1 Running 0 6h31m 10.72.75.254 compute2 <none> <none>
66 onos-tost-onos-classic-0 1/1 Running 0 98m 10.72.106.133 compute3 <none> <none>
67 onos-tost-onos-classic-1 1/1 Running 0 6h31m 10.72.111.207 compute1 <none> <none>
68 onos-tost-onos-classic-2 1/1 Running 0 6h31m 10.72.75.247 compute2 <none> <none>
69 onos-tost-onos-classic-onos-config-loader-ddc9d68bb-lq97t 1/1 Running 0 6h19m 10.72.106.190 compute3 <none> <none>
70 stratum-bwlvh 1/1 Running 0 6h31m 10.76.28.70 leaf1 <none> <none>
71 stratum-gh842 1/1 Running 0 6h31m 10.76.28.71 leaf2 <none> <none>
72
733 Atomix nodes and 3 ONOS nodes are needed for HA. `onos-config-loader` is equally important, because without ONOS
74cannot be properly configured. The number of Stratum pods depend on the deployed topology. If the status of the pods
75is not `Running` you can check the events published by k8s components to have a first idea of what is happening:
76
77.. code-block::
78
79 $ kubectl get events -n sdfabric --sort-by='.lastTimestamp'
80 LAST SEEN TYPE REASON OBJECT MESSAGE
81 12m Normal Scheduled pod/telegraf-75b959574d-sl8qb Successfully assigned tost/telegraf-75b959574d-sl8qb to compute3
82 12m Normal SuccessfulCreate replicaset/telegraf-75b959574d Created pod: telegraf-75b959574d-sl8qb
83 12m Normal ScalingReplicaSet deployment/telegraf Scaled up replica set telegraf-75b959574d to 1
84 12m Normal Pulled pod/telegraf-75b959574d-sl8qb Container image "telegraf:1.17" already present on machine
85 12m Normal AddedInterface pod/telegraf-75b959574d-sl8qb Add eth0 [10.72.106.153/32]
86 12m Normal Started pod/telegraf-75b959574d-sl8qb Started container telegraf
87 12m Normal Created pod/telegraf-75b959574d-sl8qb Created container telegraf
88 ...
89
90The option ``--sort-by='.lastTimestamp'`` is typically used to get the events sorted by time. The previous command
91will report all the events happened in the ``sdfabric`` namespace, if you want to have more insights on a specific
92pod, it is possible to use the command ``kubectl describe pods``:
93
94.. code-block::
95
96 $ kubectl describe pods -n sdfabric onos-tost-onos-classic-0
97 Name: onos-tost-onos-classic-0
98 Namespace: sdfabric
99 Priority: 0
100 Node: compute3/10.76.28.68
101 Start Time: Mon, 11 Oct 2021 10:35:43 +0200
102 ...
103 Events:
104 Type Reason Age From Message
105 ...
106 {"message":"pending"}
107 org.onosproject.segmentrouting is not yet ready
108
109The ``Events`` section provides typically useful information about the issues the pod is facing.
110
111Both ONOS and Atomix define readiness probes which will make sure that the pods are ready before any configuration
112will take place. As consequence of this, if the probes fail for a given pod you will notice in the output of the command
113``kubectl get pods``` near its name ``0/1`` under the column ``READY``. We report in `ONOS pod not ready (1)`_ and
114`ONOS pod not ready (2)`_ two scenarios frequently faced by the SD-Fabric developers.
115
116Logs of the SD-Fabric pods can be accessed by using ``kubectl logs`` command
117
118.. code-block::
119
120 $ kubectl -n sdfabric logs onos-tost-onos-classic-0
121 2021-10-12 04:46:17,955 INFO [EventAdminConfigurationNotifier] Sending Event Admin notification (configuration successful) to org/ops4j/pax/logging/Configuration
122 ...
123 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Changes to perform:
124 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Region: root
125 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Bundles to install:
126
127
128ONOS Troubleshooting
129--------------------
130
131You can get the ONOS CLI by establishing SSH connection to the port ``8101`` (default password is `karaf`):
132
133.. code-block::
134
pierventre16cc8022021-10-14 10:34:57 +0200135 $ kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101
pierventre517cd532021-10-12 22:58:00 +0200136 // In another terminal or you can send to /dev/null the port-forward
137 $ ssh -p 8101 karaf@localhost
138 The authenticity of host '[localhost]:8101 ([127.0.0.1]:8101)' can't be established.
139 RSA key fingerprint is SHA256:Mlaax9tHmIR6WwK0B3okC1O4mpAuoXjI7Z5+KKelxOo.
140 Are you sure you want to continue connecting (yes/no)? yes
141 Warning: Permanently added '[localhost]:8101' (RSA) to the list of known hosts.
142 Password authentication
143 Password:
144 Welcome to Open Network Operating System (ONOS)!
145 ____ _ ______ ____
146 / __ \/ |/ / __ \/ __/
147 / /_/ / / /_/ /\ \
148 \____/_/|_/\____/___/
149
150 Documentation: wiki.onosproject.org
151 Tutorials: tutorials.onosproject.org
152 Mailing lists: lists.onosproject.org
153
154 Come help out! Find out how at: contribute.onosproject.org
155
156 Hit '<tab>' for a list of available commands
157 and '[cmd] --help' for help on a specific command.
158 Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
159
160 karaf@root >
161
162Alternatively, if this is not possible to establish an ssh connection with the ONOS pods,
163it is possible to use ``kubectl exec`` command on the target pod:
164
165.. code-block::
166
pierventre16cc8022021-10-14 10:34:57 +0200167 $ kubectl -n sdfabric exec -it onos-tost-onos-classic-0 -- bash apache-karaf-4.2.9/bin/client
pierventre517cd532021-10-12 22:58:00 +0200168 Welcome to Open Network Operating System (ONOS)!
169 ____ _ ______ ____
170 / __ \/ |/ / __ \/ __/
171 / /_/ / / /_/ /\ \
172 \____/_/|_/\____/___/
173
174 Documentation: wiki.onosproject.org
175 Tutorials: tutorials.onosproject.org
176 Mailing lists: lists.onosproject.org
177
178 Come help out! Find out how at: contribute.onosproject.org
179
180 Hit '<tab>' for a list of available commands
181 and '[cmd] --help' for help on a specific command.
182 Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
183
184 karaf@root
185
186You can attach to the ONOS logs by using the ``log:tail`` command:
187
188.. code-block::
189
190 $ karaf@root > log:tail
191 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine1 -> device:leaf1
192 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine2 -> device:leaf1
193 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf1 -> device:spine1
194 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf2 -> device:spine1
195
196The command will display continuously the log entries - this is useful for a live debugging session.
197Complete ONOS logs can be accessed by using ``kubectl logs`` command as explained in the previous section.
198If anything can be figured out from the logs, you can access
199to the ONOS state by issuing specific CLI commands. We report in the section `Frequently Used Commands`_ few commands we frequently use
200when troubleshooting SD-Fabric.
201
pierventre16cc8022021-10-14 10:34:57 +0200202Pipeline Walk-through
203^^^^^^^^^^^^^^^^^^^^^
204.. note::
205 More information of Pipeline Walk-through is coming soon
206
pierventre517cd532021-10-12 22:58:00 +0200207onos-diagnostics
208^^^^^^^^^^^^^^^^
209
210In the case where you can't figure out what is going wrong, you can seek help on SD-Fabric developer mailing list
211``sdfabric-dev@opennetworking.org`` or you can reach out on the ``sdfabric-dev`` Slack channel. There are a few
212things we would like you to attach:
213
214- **Issue description**
215
216- **Environment description**, such as SD-Fabric version, switch model and SDE version
217 version
218
219- **Steps of reproduction**, as detail as possible
220
221- **Diagnostics**.
222
223We have built a tool `onos-diagnostics-k8s <https://wiki.onosproject.org/display/ONOS/ONOS+Remote+Admin+Tools>`_
pierventre16cc8022021-10-14 10:34:57 +0200224to help you easily collect and package ONOS diagnostics. The tool collects various information from the running
225ONOS cluster and packages it into one, easy-to-share archive file. This tool is distributed as part of the ONOS
226software itself (under bin directory), but is also available as part of a small archive of remote tools to administer
227an ONOS cluster (`onos-admin-\*.tar.gz`).
228
229Alternatively, it is possible to use ``onos-diagnostics-k8s`` in Kubernetes enabled environments. The tool will produce
230the same results of onos-diagnostics and relies only on ``kubectl`` commands. The tool need to know the name of
231the namespace and this can be provided through the option ``-s``. Then, you have to provide the names of the target
232pods. To avoid having to specify these names as part of the command, you can export the ``ONOS_PODS`` environment
233variable. Here’s an example of how to set the variable:
234
235.. code-block::
236
237 $ export ONOS_PODS="onos-0 onos-1 onos-2"
238
239The tool needs to know the Karaf home (path from the mount point). To avoid having to specify this path as part
240of the command, you can export the ``KARAF_HOME`` environment variable:
241
242.. code-block::
243
244 $ export KARAF_HOME="apache-karaf-4.2.9"
245
246Once done, the ``onos-diagnostics-k8s`` tool can be run as follows:
247
248.. code-block::
249
250 $ onos-diagnostics-k8s -s sdfabric
251
252There is the option ``-n`` that allows for naming the resulting archive file for differentiation between different
253cluster instances, e.g.
254
255.. code-block::
256
257 # This will produce archive file /tmp/delta-pod-diags.tar.gz
258 $ onos-diagnostics-k8s -s sdfabric -n delta-pod
259
260By default ``onos-diagnostics-k8s`` will use ``ONOS_PROFILE`` to collect the diagnostics, you can tailor the behavior of the
261command to your needs by specifying a different `profile <https://github.com/opennetworkinglab/onos/blob/master/tools/package/runtime/bin/onos-diagnostics-profile>`_.
262For SD-Fabric we suggest to use ``TRELLIS_PROFILE``. The resulting `/tmp/\*-diags.tar.gz` file will contain all
263relevant information about the ONOS cluster.
264
265The following is an example of a complete ``onos-diagnostics-k8s`` command:
266
267.. code-block::
268
269 $ DIAGS_PROFILE=TRELLIS_PROFILE onos-diagnostics-k8s -k apache-karaf-4.2.9 -s sdfabric onos-tost-onos-classic-0 onos-tost-onos-classic-1 onos-tost-onos-classic-2
pierventre517cd532021-10-12 22:58:00 +0200270
271UP4 Troubleshooting
272-------------------
273
274.. note::
275 More information of UP4 troubleshoot is coming soon
276
277Common Issues
278-------------
279
280.. note::
281 Here is a list of common issues.
282 More details of each case are coming soon
283
284ImagePullBackOff
285^^^^^^^^^^^^^^^^
286
287ONOS pod not ready (1)
288^^^^^^^^^^^^^^^^^^^^^^
289
290ONOS pod not ready (2)
291^^^^^^^^^^^^^^^^^^^^^^
292
293ONOS pods not configured
294^^^^^^^^^^^^^^^^^^^^^^^^
295
296Packet-In not working
297^^^^^^^^^^^^^^^^^^^^^
298
299Device offline
300^^^^^^^^^^^^^^
301
Charles Chanbf55e742021-10-04 17:46:46 -0700302Frequently Used Commands
303------------------------
pierventre517cd532021-10-12 22:58:00 +0200304
305In this subsection, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric.
Charles Chanbf55e742021-10-04 17:46:46 -0700306
307ONOS
308^^^^
309To execute following ONOS CLI commands,
310
pierventre16cc8022021-10-14 10:34:57 +0200311- Create K8s port forwarding by `kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101`
Charles Chanbf55e742021-10-04 17:46:46 -0700312- Login to ONOS CLI by `ssh -p 8101 karaf@localhost`. Default password is `karaf`
313
314ONOS basics
315"""""""""""
316- `flows`: List flow tables. `-s` for simplified output.
317- `groups`: List group tables. `-s` for simplified output.
318- `devices`: List device information. `-s` for simplified output.
319- `ports`: List port information. `-e` to list enabled ports only.
320- `links`: List discovered links
321- `hosts`: List discovered hosts. `-s` for simplified output.
322- `netcfg`: List network configuration
323- `interfaces`: List interface configuration
324
325trellis-control
326"""""""""""""""
327- `sr-pr-list`: List current recovery phase of each device
328- `sr-device-subnets`: List device-subnet mapping
329
330fabric-tna
331""""""""""
332- `slices`: List network slices
333- `tcs`: List traffic classes of given slice
334
335up4
336"""
337- `read-interfaces`: List all interfaces installed in the data plane
338- `read-pdrs`: List all PDRs installed in the data plane
339- `read-fars`: List all FARS installed in the data plane
340- `read-flows`: List all UE data flows installed in the data plane)
341
342Stratum
343^^^^^^^
Carmelo Cascone43989982021-10-12 00:01:19 -0700344To execute following BF Shell commands,
Charles Chanbf55e742021-10-04 17:46:46 -0700345
346- Login to Stratum switch by `ssh root@<switch_ip>`. Default password is `onl`
347- Attach to Stratum docker container by `docker attach \`docker ps | grep stratum-bfrt | awk \'{print $1}\'\``
348
349 - Hit `enter` for the prompt
350 - Use `<Ctrl-P><Ctrl-Q>` to exit the container. Do not use `<Ctrl-C>` since it will terminate the process.
351
Carmelo Cascone43989982021-10-12 00:01:19 -0700352BF Shell
353""""""""
Charles Chanbf55e742021-10-04 17:46:46 -0700354- `pm.show`: List port configurations. `-a` to list all ports.