First version of troubleshooting guide Change-Id: I1cfc5e93fb0c17601a280754b2841db680552d58

commit: 517cd53140150aa762196274f6ee5e60ed4b208d [log] [tgz]
author: pierventre <pier@opennetworking.org> Tue Oct 12 22:58:00 2021 +0200
committer: Charles Chan <charles@opennetworking.org> Tue Oct 12 15:55:49 2021 -0700
tree: 4120a395b7a61d2e744c7d97a3f16cafcda827c8
parent: 43989980af78f20c40343bfc2afec4a8e88e087a [diff]
diff --git a/architecture.rst b/architecture.rst
index 4fa4cb1..16b83a1 100644
--- a/architecture.rst
+++ b/architecture.rst

@@ -1,3 +1,5 @@
+.. _architecture_design:
+
 Architecture and Design
 =======================
 

diff --git a/dict.txt b/dict.txt
index c839c43..3598def 100644
--- a/dict.txt
+++ b/dict.txt

@@ -3,6 +3,7 @@
 ACL
 Aether
 Analytics
+Atomix
 Broadcom
 Clos
 DDoS
@@ -91,6 +92,7 @@
 misconfiguration
 multicast
 namespace
+namespaces
 natively
 netcfg
 nodeSelector

diff --git a/troubleshooting.rst b/troubleshooting.rst
index 6a64d34..149d051 100644
--- a/troubleshooting.rst
+++ b/troubleshooting.rst

@@ -3,9 +3,256 @@
 Troubleshooting Guide
 =====================
 
+In this section we are going to provide hints and useful commands to help you troubleshoot traffic-related problems
+or k8s related issues. It is important to remember that these two types of issues are highly related as both
+control plane software and data plane software are containerized and deployed as Kubernetes services in SD-Fabric.
+Please refer to :ref:`architecture_design` for further details.
+
+K8s troubleshooting
+-------------------
+
+We assume that the tool ``kubectl`` have been install already on your local machine.
+First step is to setup the proper ``kubeconfig`` file to access the k8s cluster you want to troubleshoot:
+
+.. code-block::
+
+    $ export KUBECONFIG=~/kubeconfig/dev-sdfabric-menlo
+    $ kubectl config use-context dev-sdfabric-menlo
+      Switched to context "dev-sdfabric-menlo".
+
+You can get the list of the k8s namespaces using ``kubectl get`` command:
+
+.. code-block::
+
+    $ kubectl get namespaces
+      ...
+      kube-node-lease            Active   68d
+      kube-public                Active   68d
+      kube-system                Active   68d
+      security-scan              Active   68d
+      sdfabric                   Active   26h
+
+Let's assume that SD-Fabric resources are deployed under the namespace ``sdfabric``, so make sure that the ``sdfabric``
+namespace has been properly created (additionally other namespaces could be created - please check your overarching chart).
+
+If the deployment is not successful,
+a first check is to make sure there are enough available nodes in the target cluster.
+You can check the available nodes through ``kubectl get nodes`` command:
+
+.. code-block::
+
+    $ kubectl get nodes -o wide
+      NAME       STATUS   ROLES                      AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION             CONTAINER-RUNTIME
+      compute1   Ready    controlplane,etcd,worker   39d   v1.18.8   10.76.28.74   <none>        Ubuntu 18.04.6 LTS             5.4.0-73-generic           docker://20.10.9
+      compute2   Ready    controlplane,etcd,worker   39d   v1.18.8   10.76.28.72   <none>        Ubuntu 18.04.5 LTS             5.4.0-73-generic           docker://19.3.15
+      compute3   Ready    controlplane,etcd,worker   39d   v1.18.8   10.76.28.68   <none>        Ubuntu 18.04.5 LTS             5.4.0-73-generic           docker://19.3.15
+      leaf1      Ready    worker                     39d   v1.18.8   10.76.28.70   <none>        Debian GNU/Linux 9 (stretch)   4.14.49-OpenNetworkLinux   docker://19.3.15
+      leaf2      Ready    worker                     39d   v1.18.8   10.76.28.71   <none>        Debian GNU/Linux 9 (stretch)   4.14.49-OpenNetworkLinux   docker://19.3.15
+
+You should have at least `3+N` available nodes, where N depends on the deployed network topology. Please note that ONOS
+cannot be scheduled on the network devices (these are special worker nodes), and different ONOS cannot share the same worker
+node (the same applies for Atomix).
+
+At least you should have some basic containers that are present in each deployment.
+You can get the list of the pods by using ``kubectl get pods -n sdfabric``:
+
+.. code-block::
+
+    $ kubectl get pods -n sdfabric -o wide
+      NAME                                                        READY   STATUS    RESTARTS   AGE     IP              NODE       NOMINATED NODE   READINESS GATES
+      onos-tost-atomix-0                                          1/1     Running   0          6h31m   10.72.106.161   compute3   <none>           <none>
+      onos-tost-atomix-1                                          1/1     Running   0          6h31m   10.72.111.229   compute1   <none>           <none>
+      onos-tost-atomix-2                                          1/1     Running   0          6h31m   10.72.75.254    compute2   <none>           <none>
+      onos-tost-onos-classic-0                                    1/1     Running   0          98m     10.72.106.133   compute3   <none>           <none>
+      onos-tost-onos-classic-1                                    1/1     Running   0          6h31m   10.72.111.207   compute1   <none>           <none>
+      onos-tost-onos-classic-2                                    1/1     Running   0          6h31m   10.72.75.247    compute2   <none>           <none>
+      onos-tost-onos-classic-onos-config-loader-ddc9d68bb-lq97t   1/1     Running   0          6h19m   10.72.106.190   compute3   <none>           <none>
+      stratum-bwlvh                                               1/1     Running   0          6h31m   10.76.28.70     leaf1      <none>           <none>
+      stratum-gh842                                               1/1     Running   0          6h31m   10.76.28.71     leaf2      <none>           <none>
+
+3 Atomix nodes and 3 ONOS nodes are needed for HA. `onos-config-loader` is equally important, because without ONOS
+cannot be properly configured. The number of Stratum pods depend on the deployed topology. If the status of the pods
+is not `Running` you can check the events published by k8s components to have a first idea of what is happening:
+
+.. code-block::
+
+    $ kubectl get events -n sdfabric --sort-by='.lastTimestamp'
+      LAST SEEN   TYPE      REASON              OBJECT                                                           MESSAGE
+      12m         Normal    Scheduled           pod/telegraf-75b959574d-sl8qb                                    Successfully assigned tost/telegraf-75b959574d-sl8qb to compute3
+      12m         Normal    SuccessfulCreate    replicaset/telegraf-75b959574d                                   Created pod: telegraf-75b959574d-sl8qb
+      12m         Normal    ScalingReplicaSet   deployment/telegraf                                              Scaled up replica set telegraf-75b959574d to 1
+      12m         Normal    Pulled              pod/telegraf-75b959574d-sl8qb                                    Container image "telegraf:1.17" already present on machine
+      12m         Normal    AddedInterface      pod/telegraf-75b959574d-sl8qb                                    Add eth0 [10.72.106.153/32]
+      12m         Normal    Started             pod/telegraf-75b959574d-sl8qb                                    Started container telegraf
+      12m         Normal    Created             pod/telegraf-75b959574d-sl8qb                                    Created container telegraf
+      ...
+
+The option ``--sort-by='.lastTimestamp'`` is typically used to get the events sorted by time. The previous command
+will report all the events happened in the ``sdfabric`` namespace, if you want to have more insights on a specific
+pod, it is possible to use the command ``kubectl describe pods``:
+
+.. code-block::
+
+    $ kubectl describe pods -n sdfabric onos-tost-onos-classic-0
+      Name:         onos-tost-onos-classic-0
+      Namespace:    sdfabric
+      Priority:     0
+      Node:         compute3/10.76.28.68
+      Start Time:   Mon, 11 Oct 2021 10:35:43 +0200
+      ...
+      Events:
+        Type     Reason          Age   From               Message
+        ...
+        {"message":"pending"}
+        org.onosproject.segmentrouting is not yet ready
+
+The ``Events`` section provides typically useful information about the issues the pod is facing.
+
+Both ONOS and Atomix define readiness probes which will make sure that the pods are ready before any configuration
+will take place. As consequence of this, if the probes fail for a given pod you will notice in the output of the command
+``kubectl get pods``` near its name ``0/1`` under the column ``READY``. We report in `ONOS pod not ready (1)`_ and
+`ONOS pod not ready (2)`_ two scenarios frequently faced by the SD-Fabric developers.
+
+Logs of the SD-Fabric pods can be accessed by using ``kubectl logs`` command
+
+.. code-block::
+
+    $ kubectl -n sdfabric logs onos-tost-onos-classic-0
+      2021-10-12 04:46:17,955 INFO  [EventAdminConfigurationNotifier] Sending Event Admin notification (configuration successful) to org/ops4j/pax/logging/Configuration
+      ...
+      2021-10-12 04:46:18,991 INFO  [FeaturesServiceImpl] Changes to perform:
+      2021-10-12 04:46:18,991 INFO  [FeaturesServiceImpl]   Region: root
+      2021-10-12 04:46:18,991 INFO  [FeaturesServiceImpl]     Bundles to install:
+
+
+ONOS Troubleshooting
+--------------------
+
+You can get the ONOS CLI by establishing SSH connection to the port ``8101`` (default password is `karaf`):
+
+.. code-block::
+
+    $ kubectl -n tost port-forward onos-tost-onos-classic-0 8101
+    // In another terminal or you can send to /dev/null the port-forward
+    $ ssh -p 8101 karaf@localhost
+      The authenticity of host '[localhost]:8101 ([127.0.0.1]:8101)' can't be established.
+      RSA key fingerprint is SHA256:Mlaax9tHmIR6WwK0B3okC1O4mpAuoXjI7Z5+KKelxOo.
+      Are you sure you want to continue connecting (yes/no)? yes
+      Warning: Permanently added '[localhost]:8101' (RSA) to the list of known hosts.
+      Password authentication
+      Password:
+      Welcome to Open Network Operating System (ONOS)!
+           ____  _  ______  ____
+          / __ \/ |/ / __ \/ __/
+         / /_/ /    / /_/ /\ \
+         \____/_/|_/\____/___/
+
+      Documentation: wiki.onosproject.org
+      Tutorials:     tutorials.onosproject.org
+      Mailing lists: lists.onosproject.org
+
+      Come help out! Find out how at: contribute.onosproject.org
+
+      Hit '<tab>' for a list of available commands
+      and '[cmd] --help' for help on a specific command.
+      Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
+
+      karaf@root >
+
+Alternatively, if this is not possible to establish an ssh connection with the ONOS pods,
+it is possible to use ``kubectl exec`` command on the target pod:
+
+.. code-block::
+
+    $ kubectl -n tost exec -it onos-tost-onos-classic-0 -- bash apache-karaf-4.2.9/bin/client
+      Welcome to Open Network Operating System (ONOS)!
+           ____  _  ______  ____
+          / __ \/ |/ / __ \/ __/
+         / /_/ /    / /_/ /\ \
+         \____/_/|_/\____/___/
+
+      Documentation: wiki.onosproject.org
+      Tutorials:     tutorials.onosproject.org
+      Mailing lists: lists.onosproject.org
+
+      Come help out! Find out how at: contribute.onosproject.org
+
+      Hit '<tab>' for a list of available commands
+      and '[cmd] --help' for help on a specific command.
+      Hit '<ctrl-d>' or type 'logout' to exit ONOS session.
+
+      karaf@root
+
+You can attach to the ONOS logs by using the ``log:tail`` command:
+
+.. code-block::
+
+    $ karaf@root > log:tail
+      20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine1 -> device:leaf1
+      20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine2 -> device:leaf1
+      20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf1 -> device:spine1
+      20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf2 -> device:spine1
+
+The command will display continuously the log entries - this is useful for a live debugging session.
+Complete ONOS logs can be accessed by using ``kubectl logs`` command as explained in the previous section.
+If anything can be figured out from the logs, you can access
+to the ONOS state by issuing specific CLI commands. We report in the section `Frequently Used Commands`_ few commands we frequently use
+when troubleshooting SD-Fabric.
+
+onos-diagnostics
+^^^^^^^^^^^^^^^^
+
+In the case where you can't figure out what is going wrong, you can seek help on SD-Fabric developer mailing list
+``sdfabric-dev@opennetworking.org`` or you can reach out on the ``sdfabric-dev`` Slack channel. There are a few
+things we would like you to attach:
+
+- **Issue description**
+
+- **Environment description**, such as SD-Fabric version, switch model and SDE version
+  version
+
+- **Steps of reproduction**, as detail as possible
+
+- **Diagnostics**.
+
+We have built a tool `onos-diagnostics-k8s <https://wiki.onosproject.org/display/ONOS/ONOS+Remote+Admin+Tools>`_
+to help you easily collect and package ONOS diagnostics.
+
+UP4 Troubleshooting
+-------------------
+
+.. note::
+  More information of UP4 troubleshoot is coming soon
+
+Common Issues
+-------------
+
+.. note::
+  Here is a list of common issues.
+  More details of each case are coming soon
+
+ImagePullBackOff
+^^^^^^^^^^^^^^^^
+
+ONOS pod not ready (1)
+^^^^^^^^^^^^^^^^^^^^^^
+
+ONOS pod not ready (2)
+^^^^^^^^^^^^^^^^^^^^^^
+
+ONOS pods not configured
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Packet-In not working
+^^^^^^^^^^^^^^^^^^^^^
+
+Device offline
+^^^^^^^^^^^^^^
+
 Frequently Used Commands
 ------------------------
-In this section, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric.
+
+In this subsection, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric.
 
 ONOS
 ^^^^
@@ -55,18 +302,3 @@
 BF Shell
 """"""""
 - `pm.show`: List port configurations. `-a` to list all ports.
-
-K8s troubleshooting
--------------------
-..
-    TODO Hung-Wei
-
-ONOS diagnostics
-----------------
-..
-    TODO Hung-Wei
-
-FAQ
----
-..
-    TODO Hung-Wei
commit	517cd53140150aa762196274f6ee5e60ed4b208d	[log] [tgz]
author	pierventre <pier@opennetworking.org>	Tue Oct 12 22:58:00 2021 +0200
committer	Charles Chan <charles@opennetworking.org>	Tue Oct 12 15:55:49 2021 -0700
tree	4120a395b7a61d2e744c7d97a3f16cafcda827c8
parent	43989980af78f20c40343bfc2afec4a8e88e087a [diff]