First version of troubleshooting guide
Change-Id: I1cfc5e93fb0c17601a280754b2841db680552d58
diff --git a/architecture.rst b/architecture.rst
index 4fa4cb1..16b83a1 100644
--- a/architecture.rst
+++ b/architecture.rst
@@ -1,3 +1,5 @@
+.. _architecture_design:
+
Architecture and Design
=======================
diff --git a/dict.txt b/dict.txt
index c839c43..3598def 100644
--- a/dict.txt
+++ b/dict.txt
@@ -3,6 +3,7 @@
ACL
Aether
Analytics
+Atomix
Broadcom
Clos
DDoS
@@ -91,6 +92,7 @@
misconfiguration
multicast
namespace
+namespaces
natively
netcfg
nodeSelector
diff --git a/troubleshooting.rst b/troubleshooting.rst
index 6a64d34..149d051 100644
--- a/troubleshooting.rst
+++ b/troubleshooting.rst
@@ -3,9 +3,256 @@
Troubleshooting Guide
=====================
+In this section we are going to provide hints and useful commands to help you troubleshoot traffic-related problems
+or k8s related issues. It is important to remember that these two types of issues are highly related as both
+control plane software and data plane software are containerized and deployed as Kubernetes services in SD-Fabric.
+Please refer to :ref:`architecture_design` for further details.
+
+K8s troubleshooting
+-------------------
+
+We assume that the tool ``kubectl`` have been install already on your local machine.
+First step is to setup the proper ``kubeconfig`` file to access the k8s cluster you want to troubleshoot:
+
+.. code-block::
+
+ $ export KUBECONFIG=~/kubeconfig/dev-sdfabric-menlo
+ $ kubectl config use-context dev-sdfabric-menlo
+ Switched to context "dev-sdfabric-menlo".
+
+You can get the list of the k8s namespaces using ``kubectl get`` command:
+
+.. code-block::
+
+ $ kubectl get namespaces
+ ...
+ kube-node-lease Active 68d
+ kube-public Active 68d
+ kube-system Active 68d
+ security-scan Active 68d
+ sdfabric Active 26h
+
+Let's assume that SD-Fabric resources are deployed under the namespace ``sdfabric``, so make sure that the ``sdfabric``
+namespace has been properly created (additionally other namespaces could be created - please check your overarching chart).
+
+If the deployment is not successful,
+a first check is to make sure there are enough available nodes in the target cluster.
+You can check the available nodes through ``kubectl get nodes`` command:
+
+.. code-block::
+
+ $ kubectl get nodes -o wide
+ NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
+ compute1 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.74 <none> Ubuntu 18.04.6 LTS 5.4.0-73-generic docker://20.10.9
+ compute2 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.72 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
+ compute3 Ready controlplane,etcd,worker 39d v1.18.8 10.76.28.68 <none> Ubuntu 18.04.5 LTS 5.4.0-73-generic docker://19.3.15
+ leaf1 Ready worker 39d v1.18.8 10.76.28.70 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15
+ leaf2 Ready worker 39d v1.18.8 10.76.28.71 <none> Debian GNU/Linux 9 (stretch) 4.14.49-OpenNetworkLinux docker://19.3.15
+
+You should have at least `3+N` available nodes, where N depends on the deployed network topology. Please note that ONOS
+cannot be scheduled on the network devices (these are special worker nodes), and different ONOS cannot share the same worker
+node (the same applies for Atomix).
+
+At least you should have some basic containers that are present in each deployment.
+You can get the list of the pods by using ``kubectl get pods -n sdfabric``:
+
+.. code-block::
+
+ $ kubectl get pods -n sdfabric -o wide
+ NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
+ onos-tost-atomix-0 1/1 Running 0 6h31m 10.72.106.161 compute3 <none> <none>
+ onos-tost-atomix-1 1/1 Running 0 6h31m 10.72.111.229 compute1 <none> <none>
+ onos-tost-atomix-2 1/1 Running 0 6h31m 10.72.75.254 compute2 <none> <none>
+ onos-tost-onos-classic-0 1/1 Running 0 98m 10.72.106.133 compute3 <none> <none>
+ onos-tost-onos-classic-1 1/1 Running 0 6h31m 10.72.111.207 compute1 <none> <none>
+ onos-tost-onos-classic-2 1/1 Running 0 6h31m 10.72.75.247 compute2 <none> <none>
+ onos-tost-onos-classic-onos-config-loader-ddc9d68bb-lq97t 1/1 Running 0 6h19m 10.72.106.190 compute3 <none> <none>
+ stratum-bwlvh 1/1 Running 0 6h31m 10.76.28.70 leaf1 <none> <none>
+ stratum-gh842 1/1 Running 0 6h31m 10.76.28.71 leaf2 <none> <none>
+
+3 Atomix nodes and 3 ONOS nodes are needed for HA. `onos-config-loader` is equally important, because without ONOS
+cannot be properly configured. The number of Stratum pods depend on the deployed topology. If the status of the pods
+is not `Running` you can check the events published by k8s components to have a first idea of what is happening:
+
+.. code-block::
+
+ $ kubectl get events -n sdfabric --sort-by='.lastTimestamp'
+ LAST SEEN TYPE REASON OBJECT MESSAGE
+ 12m Normal Scheduled pod/telegraf-75b959574d-sl8qb Successfully assigned tost/telegraf-75b959574d-sl8qb to compute3
+ 12m Normal SuccessfulCreate replicaset/telegraf-75b959574d Created pod: telegraf-75b959574d-sl8qb
+ 12m Normal ScalingReplicaSet deployment/telegraf Scaled up replica set telegraf-75b959574d to 1
+ 12m Normal Pulled pod/telegraf-75b959574d-sl8qb Container image "telegraf:1.17" already present on machine
+ 12m Normal AddedInterface pod/telegraf-75b959574d-sl8qb Add eth0 [10.72.106.153/32]
+ 12m Normal Started pod/telegraf-75b959574d-sl8qb Started container telegraf
+ 12m Normal Created pod/telegraf-75b959574d-sl8qb Created container telegraf
+ ...
+
+The option ``--sort-by='.lastTimestamp'`` is typically used to get the events sorted by time. The previous command
+will report all the events happened in the ``sdfabric`` namespace, if you want to have more insights on a specific
+pod, it is possible to use the command ``kubectl describe pods``:
+
+.. code-block::
+
+ $ kubectl describe pods -n sdfabric onos-tost-onos-classic-0
+ Name: onos-tost-onos-classic-0
+ Namespace: sdfabric
+ Priority: 0
+ Node: compute3/10.76.28.68
+ Start Time: Mon, 11 Oct 2021 10:35:43 +0200
+ ...
+ Events:
+ Type Reason Age From Message
+ ...
+ {"message":"pending"}
+ org.onosproject.segmentrouting is not yet ready
+
+The ``Events`` section provides typically useful information about the issues the pod is facing.
+
+Both ONOS and Atomix define readiness probes which will make sure that the pods are ready before any configuration
+will take place. As consequence of this, if the probes fail for a given pod you will notice in the output of the command
+``kubectl get pods``` near its name ``0/1`` under the column ``READY``. We report in `ONOS pod not ready (1)`_ and
+`ONOS pod not ready (2)`_ two scenarios frequently faced by the SD-Fabric developers.
+
+Logs of the SD-Fabric pods can be accessed by using ``kubectl logs`` command
+
+.. code-block::
+
+ $ kubectl -n sdfabric logs onos-tost-onos-classic-0
+ 2021-10-12 04:46:17,955 INFO [EventAdminConfigurationNotifier] Sending Event Admin notification (configuration successful) to org/ops4j/pax/logging/Configuration
+ ...
+ 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Changes to perform:
+ 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Region: root
+ 2021-10-12 04:46:18,991 INFO [FeaturesServiceImpl] Bundles to install:
+
+
+ONOS Troubleshooting
+--------------------
+
+You can get the ONOS CLI by establishing SSH connection to the port ``8101`` (default password is `karaf`):
+
+.. code-block::
+
+ $ kubectl -n tost port-forward onos-tost-onos-classic-0 8101
+ // In another terminal or you can send to /dev/null the port-forward
+ $ ssh -p 8101 karaf@localhost
+ The authenticity of host '[localhost]:8101 ([127.0.0.1]:8101)' can't be established.
+ RSA key fingerprint is SHA256:Mlaax9tHmIR6WwK0B3okC1O4mpAuoXjI7Z5+KKelxOo.
+ Are you sure you want to continue connecting (yes/no)? yes
+ Warning: Permanently added '[localhost]:8101' (RSA) to the list of known hosts.
+ Password authentication
+ Password:
+ Welcome to Open Network Operating System (ONOS)!
+ ____ _ ______ ____
+ / __ \/ |/ / __ \/ __/
+ / /_/ / / /_/ /\ \
+ \____/_/|_/\____/___/
+
+ Documentation: wiki.onosproject.org
+ Tutorials: tutorials.onosproject.org
+ Mailing lists: lists.onosproject.org
+
+ Come help out! Find out how at: contribute.onosproject.org
+
+ Hit '<tab>' for a list of available commands
+ and '[cmd] --help' for help on a specific command.
+ Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
+
+ karaf@root >
+
+Alternatively, if this is not possible to establish an ssh connection with the ONOS pods,
+it is possible to use ``kubectl exec`` command on the target pod:
+
+.. code-block::
+
+ $ kubectl -n tost exec -it onos-tost-onos-classic-0 -- bash apache-karaf-4.2.9/bin/client
+ Welcome to Open Network Operating System (ONOS)!
+ ____ _ ______ ____
+ / __ \/ |/ / __ \/ __/
+ / /_/ / / /_/ /\ \
+ \____/_/|_/\____/___/
+
+ Documentation: wiki.onosproject.org
+ Tutorials: tutorials.onosproject.org
+ Mailing lists: lists.onosproject.org
+
+ Come help out! Find out how at: contribute.onosproject.org
+
+ Hit '<tab>' for a list of available commands
+ and '[cmd] --help' for help on a specific command.
+ Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
+
+ karaf@root
+
+You can attach to the ONOS logs by using the ``log:tail`` command:
+
+.. code-block::
+
+ $ karaf@root > log:tail
+ 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine1 -> device:leaf1
+ 20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine2 -> device:leaf1
+ 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf1 -> device:spine1
+ 20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf2 -> device:spine1
+
+The command will display continuously the log entries - this is useful for a live debugging session.
+Complete ONOS logs can be accessed by using ``kubectl logs`` command as explained in the previous section.
+If anything can be figured out from the logs, you can access
+to the ONOS state by issuing specific CLI commands. We report in the section `Frequently Used Commands`_ few commands we frequently use
+when troubleshooting SD-Fabric.
+
+onos-diagnostics
+^^^^^^^^^^^^^^^^
+
+In the case where you can't figure out what is going wrong, you can seek help on SD-Fabric developer mailing list
+``sdfabric-dev@opennetworking.org`` or you can reach out on the ``sdfabric-dev`` Slack channel. There are a few
+things we would like you to attach:
+
+- **Issue description**
+
+- **Environment description**, such as SD-Fabric version, switch model and SDE version
+ version
+
+- **Steps of reproduction**, as detail as possible
+
+- **Diagnostics**.
+
+We have built a tool `onos-diagnostics-k8s <https://wiki.onosproject.org/display/ONOS/ONOS+Remote+Admin+Tools>`_
+to help you easily collect and package ONOS diagnostics.
+
+UP4 Troubleshooting
+-------------------
+
+.. note::
+ More information of UP4 troubleshoot is coming soon
+
+Common Issues
+-------------
+
+.. note::
+ Here is a list of common issues.
+ More details of each case are coming soon
+
+ImagePullBackOff
+^^^^^^^^^^^^^^^^
+
+ONOS pod not ready (1)
+^^^^^^^^^^^^^^^^^^^^^^
+
+ONOS pod not ready (2)
+^^^^^^^^^^^^^^^^^^^^^^
+
+ONOS pods not configured
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Packet-In not working
+^^^^^^^^^^^^^^^^^^^^^
+
+Device offline
+^^^^^^^^^^^^^^
+
Frequently Used Commands
------------------------
-In this section, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric.
+
+In this subsection, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric.
ONOS
^^^^
@@ -55,18 +302,3 @@
BF Shell
""""""""
- `pm.show`: List port configurations. `-a` to list all ports.
-
-K8s troubleshooting
--------------------
-..
- TODO Hung-Wei
-
-ONOS diagnostics
-----------------
-..
- TODO Hung-Wei
-
-FAQ
----
-..
- TODO Hung-Wei