Reorg, updates and troubleshooting guide - Expanded hw install - Change fabric switch bootstrap to DHCP/HTTP based ONL install - Start of operations and troubleshooting guide - Various grammar/spelling fixes, dictionary expansion Change-Id: I9b30d63a97e4443ea3871ee880646e161de8969a

commit: 9026f53e3b237e414c22ad604ed69c90324929c6 [log] [tgz]
author: Zack Williams <zdw@opennetworking.org> Mon Nov 30 11:34:32 2020 -0700
committer: Zack Williams <zdw@opennetworking.org> Tue Dec 01 21:30:46 2020 -0700
tree: a620f52aefe35c407c028f3acfc33ea22419c356
parent: 6bfbf8ae64a90300ceffaf495d00ba49548218fd [diff]
diff --git a/pronto_deployment_guide/bootstrapping.rst b/pronto_deployment_guide/bootstrapping.rst
index fcdab9e..5b0d895 100644
--- a/pronto_deployment_guide/bootstrapping.rst
+++ b/pronto_deployment_guide/bootstrapping.rst

@@ -2,107 +2,179 @@
    SPDX-FileCopyrightText: © 2020 Open Networking Foundation <support@opennetworking.org>
    SPDX-License-Identifier: Apache-2.0
 
-=============
 Bootstrapping
 =============
 
 .. _switch-install:
 
 OS Installation - Switches
-==========================
+--------------------------
+
+The installation of the ONL OS image on the fabric switches uses the DHCP and
+HTTP server set up on the management server.
+
+The default image is downloaded during that installation process by the
+``onieboot`` role. Make changes to that roll and rerun the management playbook
+to download a newer switch image.
+
+Preparation
+"""""""""""
+
+The switches have a single ethernet port that is shared between OpenBMC and
+ONL. Find out the MAC addresses for both of these ports and enter it into
+NetBox.
+
+Change boot mode to ONIE Rescue mode
+""""""""""""""""""""""""""""""""""""
+
+In order to reinstall an ONL image, you must change the ONIE bootloader to
+"Rescue Mode".
+
+Once the switch is powered on, it should retrieve an IP address on the OpenBMC
+interface with DHCP. OpenBMC uses these default credentials::
+
+  username: root
+  password: 0penBmc
+
+Login to OpenBMC with SSH::
+
+  $ ssh root@10.0.0.131
+  The authenticity of host '10.0.0.131 (10.0.0.131)' can't be established.
+  ECDSA key fingerprint is SHA256:...
+  Are you sure you want to continue connecting (yes/no)? yes
+  Warning: Permanently added '10.0.0.131' (ECDSA) to the list of known hosts.
+  root@10.0.0.131's password:
+  root@bmc:~#
+
+Using the Serial-over-LAN Console, enter ONL::
+
+  root@bmc:~# /usr/local/bin/sol.sh
+  You are in SOL session.
+  Use ctrl-x to quit.
+  -----------------------
+
+  root@onl:~#
 
 .. note::
+  If `sol.sh` is unresponsive, please try to restart the mainboard with::
 
-   This part will be done automatically once we have a DHCP and HTTP server set up in the infrastructure.
-   For now, we need to download and install the ONL image manually.
+    root@onl:~# wedge_power.sh restart
 
-Install ONL with Docker
------------------------
-First, enter **ONIE rescue mode**.
 
-Set up IP and route
-^^^^^^^^^^^^^^^^^^^
-.. code-block:: console
+Change the boot mode to rescue mode with the command ``onl-onie-boot-mode
+rescue``, and reboot::
 
-   # ip addr add 10.92.1.81/24 dev eth0
-   # ip route add default via 10.92.1.1
+  root@onl:~# onl-onie-boot-mode rescue
+  [1053033.768512] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
+  [1053033.936893] EXT4-fs (sda3): re-mounted. Opts: (null)
+  [1053033.996727] EXT4-fs (sda3): re-mounted. Opts: (null)
+  The system will boot into ONIE rescue mode at the next restart.
+  root@onl:~# reboot
 
-- `10.92.1.81/24` should be replaced by the actual IP and subnet of the ONL.
-- `10.92.1.1` should be replaced by the actual default gateway.
+At this point, ONL will go through it's shutdown sequence and ONIE will start.
+If it does not start right away, press the Enter/Return key a few times - it
+may show you a boot selection screen. Pick ``ONIE`` and ``Rescue`` if given a
+choice.
 
-Download and install ONL
-^^^^^^^^^^^^^^^^^^^^^^^^
+Installing an ONL image over HTTP
+"""""""""""""""""""""""""""""""""
 
-.. code-block:: console
+Now that the switch is in Rescue mode
 
-   # wget https://github.com/opennetworkinglab/OpenNetworkLinux/releases/download/v1.3.2/ONL-onf-ONLPv2_ONL-OS_2020-10-09.1741-f7428f2_AMD64_INSTALLED_INSTALLER
-   # sh ONL-onf-ONLPv2_ONL-OS_2020-10-09.1741-f7428f2_AMD64_INSTALLED_INSTALLER
+First, activate the Console by pressing Enter::
 
-The switch will reboot automatically once the installer is done.
+  discover: Rescue mode detected.  Installer disabled.
 
-.. note::
+  Please press Enter to activate this console.
+  To check the install status inspect /var/log/onie.log.
+  Try this:  tail -f /var/log/onie.log
 
-   Alternatively, we can `scp` the ONL installer into ONIE manually.
+  ** Rescue Mode Enabled **
+  ONIE:/ #
 
-Setup BMC for remote console access
------------------------------------
-Log in to the BMC from ONL by
+Then run the ``onie-nos-install`` command, with the URL of the management
+server on the management network segment::
 
-.. code-block:: console
+  ONIE:/ # onie-nos-install http://10.0.0.129/onie-installer
+  discover: Rescue mode detected. No discover stopped.
+  ONIE: Unable to find 'Serial Number' TLV in EEPROM data.
+  Info: Fetching http://10.0.0.129/onie-installer ...
+  Connecting to 10.0.0.129 (10.0.0.129:80)
+  installer            100% |*******************************|   322M  0:00:00 ETA
+  ONIE: Executing installer: http://10.0.0.129/onie-installer
+  installer: computing checksum of original archive
+  installer: checksum is OK
+  ...
 
-   # ssh root@192.168.0.1 # pass: 0penBmc
+The installation will now start, and then ONL will boot culminating in::
 
-on `usb0` interface.
+  Open Network Linux OS ONL-wedge100bf-32qs, 2020-11-04.19:44-64100e9
 
-Once you are in the BMC, run the following commands to setup IP and route (or offer a fixed IP with DHCP)
+  localhost login:
 
-.. code-block:: console
+The default ONL login is::
 
-   # ip addr add 10.92.1.85/24 dev eth0
-   # ip route add default via 10.92.1.1
+  username: root
+  password: onl
 
-- `10.92.1.85/24` should be replaced by the actual IP and subnet of the BMC.
-  Note that it should be different from the ONL IP.
-- `10.92.1.1` should be replaced by the actual default gateway.
+If you login, you can verify that the switch is getting it's IP address via
+DHCP::
 
-BMC uses the same ethernet port as ONL management so you should give it an IP address in the same subnet.
-BMC address will preserve during ONL reboot, but won’t be preserved during power outage.
+  root@localhost:~# ip addr
+  ...
+  3: ma1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
+      link/ether 00:90:fb:5c:e1:97 brd ff:ff:ff:ff:ff:ff
+      inet 10.0.0.130/25 brd 10.0.0.255 scope global ma1
+  ...
 
-To log in to ONL console from BMC, run
 
-.. code-block:: console
+Post-ONL Configuration
+""""""""""""""""""""""
 
-   # /usr/local/bin/sol.sh
+A ``terraform`` user must be created on the switches to allow them to be
+configured.
 
-If `sol.sh` is unresponsive, please try to restart the mainboard with
+This is done using Ansible.  Verify that your inventory (Created earlier from the
+``inventory/example-aether.ini`` file) includes an ``[aetherfabric]`` section
+that has all the names and IP addresses of the compute nodes in it.
 
-.. code-block:: console
+Then run a ping test::
 
-   # wedge_power.sh restart
+  ansible -i inventory/sitename.ini -m ping aetherfabric
 
-Setup network and host name for ONL
------------------------------------
+This may fail with the error::
 
-.. code-block:: console
+  "msg": "Using a SSH password instead of a key is not possible because Host Key checking is enabled and sshpass does not support this.  Please add this host's fingerprint to your known_hosts file to manage this host."
 
-   # hostnamectl set-hostname <host-name>
+Comment out the ``ansible_ssh_pass="onl"`` line, then rerun the ping test.  It
+may ask you about authorized keys - answer ``yes`` for each host to trust the
+keys::
 
-   # vim.tiny /etc/hosts # update accordingly
-   # cat /etc/hosts # example
-   127.0.0.1 localhost
-   10.92.1.81 menlo-staging-spine-1
+  The authenticity of host '10.0.0.138 (<no hostip for proxy command>)' can't be established.
+  ECDSA key fingerprint is SHA256:...
+  Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
 
-   # vim.tiny /etc/network/interfaces.d/ma1 # update accordingly
-   # cat /etc/network/interfaces.d/ma1 # example
-   auto ma1
-   iface ma1 inet static
-   address 10.92.1.81
-   netmask 255.255.255.0
-   gateway 10.92.1.1
-   dns-nameservers 8.8.8.8
+Once you've trusted the host keys, the ping test should succeed::
+
+  spine1.role1.site | SUCCESS => {
+      "changed": false,
+      "ping": "pong"
+  }
+  leaf1.role1.site | SUCCESS => {
+      "changed": false,
+      "ping": "pong"
+  }
+  ...
+
+Then run the playbook to create the ``terraform`` user::
+
+  ansible-playbook -i inventory/sitename.ini playbooks/aetherfabric-playbook.yml
+
+Once completed, the switch should now be ready for TOST runtime install.
 
 VPN
-===
+---
+
 This section walks you through how to set up a VPN between ACE and Aether Central in GCP.
 We will be using GitOps based Aether CD pipeline for this,
 so we just need to create a patch to **aether-pod-configs** repository.
@@ -115,7 +187,8 @@
    :ref:`Add ACE to an existing VPN connection <add_ace_to_vpn>`
 
 Before you begin
-----------------
+""""""""""""""""
+
 * Make sure firewall in front of ACE allows UDP port 500, UDP port 4500, and ESP packets
   from **gcpvpn1.infra.aetherproject.net(35.242.47.15)** and **gcpvpn2.infra.aetherproject.net(34.104.68.78)**
 * Make sure that the external IP on ACE side is owned by or routed to the management node
@@ -146,7 +219,8 @@
 +-----------------------------+----------------------------------+
 
 Download aether-pod-configs repository
---------------------------------------
+""""""""""""""""""""""""""""""""""""""
+
 .. code-block:: shell
 
    $ cd $WORKDIR
@@ -155,7 +229,8 @@
 .. _update_global_resource:
 
 Update global resource maps
----------------------------
+"""""""""""""""""""""""""""
+
 Add a new ACE information at the end of the following global resource maps.
 
 * user_map.tfvars
@@ -245,7 +320,8 @@
 
 
 Create ACE specific configurations
-----------------------------------
+""""""""""""""""""""""""""""""""""
+
 In this step, we will create a directory under `production` with the same name as ACE,
 and add several Terraform configurations and Ansible inventory needed to configure a VPN connection.
 Throughout the deployment procedure, this directory will contain all ACE specific configurations.
@@ -277,7 +353,8 @@
    when using a different BOM.
 
 Create a review request
------------------------
+"""""""""""""""""""""""
+
 .. code-block:: shell
 
    $ cd $WORKDIR/aether-pod-configs/production
@@ -302,7 +379,8 @@
 CD pipeline will create VPN tunnels on both GCP and the management node.
 
 Verify VPN connection
----------------------
+"""""""""""""""""""""
+
 You can verify the VPN connections after successful post-merge job
 by checking the routing table on the management node and trying to ping to one of the central cluster VMs.
 Make sure two tunnel interfaces, `gcp_tunnel1` and `gcp_tunnel2`, exist
@@ -335,7 +413,8 @@
 
 
 Post VPN setup
---------------
+""""""""""""""
+
 Once you verify the VPN connections, please update `ansible` directory name to `_ansible` to prevent
 the ansible playbook from running again.
 Note that it is no harm to re-run the ansible playbook but not recommended.
@@ -351,7 +430,8 @@
 .. _add_ace_to_vpn:
 
 Add another ACE to an existing VPN connection
----------------------------------------------
+"""""""""""""""""""""""""""""""""""""""""""""
+
 VPN connections can be shared when there are multiple ACE clusters in a site.
 In order to add ACE to an existing VPN connection,
 you'll have to SSH into the management node and manually update BIRD configuration.

diff --git a/pronto_deployment_guide/hw_installation.rst b/pronto_deployment_guide/hw_installation.rst
index 46c30d6..e0543d2 100644
--- a/pronto_deployment_guide/hw_installation.rst
+++ b/pronto_deployment_guide/hw_installation.rst

@@ -5,15 +5,9 @@
 Hardware Installation
 =====================
 
-Hardware installation breaks down into a few steps:
-
-1. `Planning`_
-2. `Inventory`_
-3. `Rackmount of Equipment`_
-4. `Cabling and Network Topology`_
-5. `Management Switch Bootstrap`_
-6. `Management Server Bootstrap`_
-7. `Server Software Bootstrap`_
+Once the hardware has been ordered, the installation can be planned and
+implemented. This document describes the installation of the servers and
+software.
 
 Installation of the fabric switch hardware is covered in :ref:`OS Installation
 - Switches <switch-install>`.
@@ -21,38 +15,8 @@
 Installation of the radio hardware is covered in :ref:`eNB Installation
 <enb-installation>`.
 
-Planning
---------
-The planning of the network topology and devices, and required cabling
-
-Once planning is complete, equipment is ordered to match the plan.
-
-Network Cable Plan
-""""""""""""""""""
-
-If a 2x2 TOST fabric is used it should be configured as a :doc:`Single-Stage
-Leaf-Spine <trellis:supported-topology>`.
-
-- The links between each leaf and spine switch must be made up of two separate
-  cables.
-
-- Each compute server is dual-homed via a separate cable to two different leaf
-  switches (as in the "paired switches" diagrams).
-
-If only a single P4 switch is used, the :doc:`Simple
-<trellis:supported-topology>` topology is used, with two connections from each
-compute server to the single switch
-
-Additionally a non-fabric switch is required to provide a set of management
-networks.  This management switch is configured with multiple VLANs to separate
-the management plane, fabric, and the out-of-band and lights out management
-connections on the equipment.
-
-Device Naming
-"""""""""""""
-
-Site Design and Bookkeeping
-"""""""""""""""""""""""""""
+Site Bookkeeping
+----------------
 
 The following items need to be added to `NetBox
 <https://netbox.readthedocs.io/en/stable>`_ to describe each edge site:
@@ -60,15 +24,18 @@
 1. Add a Site for the edge (if one doesn't already exist), which has the
    physical location and contact information for the edge.
 
-2. Add Racks to the Site (if they don't already exist)
+2. Add equipment Racks to the Site (if they don't already exist).
 
 3. Add a Tenant for the edge (who owns/manages it), assigned to the ``Pronto``
    or ``Aether`` Tenant Group.
 
-4. Add a VRF (Routing Table) for the edge site.
+4. Add a VRF (Routing Table) for the edge site. This is usually just the name
+   of the site.  Make sure that ``Enforce unique space`` is checked, so that IP
+   addresses within the VRF are forced to be unique, and that the Tenant Group
+   and Tenant are set.
 
 5. Add a VLAN Group to the edge site, which groups the site's VLANs and
-   prevents duplication.
+   requires that they have a unique VLAN number.
 
 6. Add VLANs for the edge site.  These should be assigned a VLAN Group, the
    Site, and Tenant.
@@ -165,6 +132,9 @@
    If a specific Device Type doesn't exist for the device, it must be created,
    which is detailed in the NetBox documentation, or ask the OPs team for help.
 
+   See `Rackmount of Equipment`_ below for guidance on how equipment should be
+   mounted in the Rack.
+
 9. Add Services to the management server:
 
     * name: ``dns``
@@ -175,8 +145,8 @@
       protocol: UDP
       port: 69
 
-   These are used by the DHCP and DNS config to know which servers offer a
-   dns service and tftp.
+   These are used by the DHCP and DNS config to know which servers offer
+   DNS or TFTP service.
 
 10. Set the MAC address for the physical interfaces on the device.
 
@@ -252,90 +222,30 @@
 
     TODO: Explain the cabling topology
 
-Hardware
-""""""""
+Rackmount of Equipment
+----------------------
 
-Fabric Switches
-'''''''''''''''
+Most of the Pronto equipment has a 19" rackmount form factor.
 
-Pronto currently uses fabric switches based on the Intel (was Barefoot) Tofino
-chipset.  There are multiple variants of this switching chipset, with different
-speeds and capabilities.
+Guidelines for mounting this equipment:
 
-The specific hardware models in use in Pronto:
+- The EdgeCore Wedge Switches have a front-to-back (aka "port-to-power") fan
+  configuration, so hot air exhaust is out the back of the switch near the
+  power inlets, away from the 32 QSFP network ports on the front of the switch.
 
-* `EdgeCore Wedge100BF-32X
-  <https://www.edge-core.com/productsInfo.php?cls=1&cls2=180&cls3=181&id=335>`_
-  - a "Dual Pipe" chipset variant, used for the Spine switches
+- The full-depth 1U and 2U Supermicro servers also have front-to-back airflow
+  but have most of their ports on the rear of the device.
 
-* `EdgeCore Wedge100BF-32QS
-  <https://www.edge-core.com/productsInfo.php?cls=1&cls2=180&cls3=181&id=770>`_
-  - a "Quad Pipe" chipset variant, used for the Leaf switches
+- Airflow through the rack should be in one direction to avoid heat being
+  pulled from one device into another.  This means that to connect the QSFP
+  network ports from the servers to the switches, cabling should be routed
+  through the rack from front (switch) to back (server).  Empty rack spaces
+  should be reserved for this purpose.
 
-Compute Servers
-
-These servers run Kubernetes and edge applications.
-
-The requirements for these servers:
-
-* AMD64 (aka x86-64) architecture
-* Sufficient resources to run Kubernetes
-* Two 40GbE or 100GbE Ethernet connections to the fabric switches
-* One management 1GbE port
-
-The specific hardware models in use in Pronto:
-
-* `Supermicro 6019U-TRTP2
-  <https://www.supermicro.com/en/products/system/1U/6019/SYS-6019U-TRTP2.cfm>`_
-  1U server
-
-* `Supermicro 6029U-TR4
-  <https://www.supermicro.com/en/products/system/2U/6029/SYS-6029U-TR4.cfm>`_
-  2U server
-
-These servers are configured with:
-
-* 2x `Intel Xeon 5220R CPUs
-  <https://ark.intel.com/content/www/us/en/ark/products/199354/intel-xeon-gold-5220r-processor-35-75m-cache-2-20-ghz.html>`_,
-  each with 24 cores, 48 threads
-* 384GB of DDR4 Memory, made up with 12x 16GB ECC DIMMs
-* 2TB of nVME Flash Storage
-* 2x 6TB SATA Disk storage
-* 2x 40GbE ports using an XL710QDA2 NIC
-
-The 1U servers additionally have:
-
-- 2x 1GbE copper network ports
-- 2x 10GbE SFP+ network ports
-
-The 2U servers have:
-
-- 4x 1GbE copper network ports
-
-Management Server
-'''''''''''''''''
-
-One management server is required, which must have at least two 1GbE network
-ports, and runs a variety of network services to support the edge.
-
-The model used in Pronto is a `Supermicro 5019D-FTN4
-<https://www.supermicro.com/en/Aplus/system/Embedded/AS-5019D-FTN4.cfm>`_
-
-Which is configured with:
-
-* AMD Epyc 3251 CPU with 8 cores, 16 threads
-* 32GB of DDR4 memory, in 2x 16GB ECC DIMMs
-* 1TB of nVME Flash storage
-* 4x 1GbE copper network ports
-
-Management Switch
-'''''''''''''''''
-
-This switch connects the configuration interfaces and management networks on
-all the servers and switches together.
-
-In the Pronto deployment this hardware is a `HP/Aruba 2540 Series JL356A
-<https://www.arubanetworks.com/products/switches/access/2540-series/>`_.
+- The short-depth management HP Switch and 1U Supermicro servers should be
+  mounted on the rear of the rack.  They both don't generate an appreciable
+  amount of heat, so the airflow direction isn't a significant factor in
+  racking them.
 
 Inventory
 ---------
@@ -361,30 +271,6 @@
 configuration will be generated to have the OS preseed files corresponding to the
 new servers based on their serial numbers.
 
-Rackmount of Equipment
-----------------------
-
-Most of the Pronto equipment is in a 19" rackmount form factor.
-
-Guidelines for mounting this equipment:
-
-- The EdgeCore Wedge Switches have a front-to-back (aka "port-to-power") fan
-  configuration, so hot air exhaust is out the back of the switch near the
-  power inlets, away from the 32 QSFP network ports on the front of the switch.
-
-- The full-depth 1U and 2U Supermicro servers also have front-to-back airflow
-  but have most of their ports on the rear of the device.
-
-- Airflow through the rack should be in one direction to avoid heat being
-  pulled from one device into another.  This means that to connect the QSFP
-  network ports from the servers to the switches, cabling should be routed
-  through the rack from front (switch) to back (server).
-
-- The short-depth management HP Switch and 1U Supermicro servers should be
-  mounted to the rear of the rack.  They both don't generate an appreciable
-  amount of heat, so the airflow direction isn't a significant factor in
-  racking them.
-
 Cabling and Network Topology
 ----------------------------
 
@@ -396,8 +282,8 @@
 TODO: Add instructions for bootstrapping management switch, from document that
 has the linked config file.
 
-Server Software Bootstrap
--------------------------
+Software Bootstrap
+------------------
 
 Management Server Bootstrap
 """""""""""""""""""""""""""
@@ -440,7 +326,7 @@
   source venv_onfansible/bin/activate
 
 Obtain the ``undionly.kpxe`` iPXE artifact for bootstrapping the compute
-servers, and put it in the ``files`` directory.
+servers, and put it in the ``playbook/files`` directory.
 
 Next, create an inventory file to access the NetBox API.  An example is given
 in ``inventory/example-netbox.yml`` - duplicate this file and modify it. Fill
@@ -456,10 +342,11 @@
 
 One manual change needs to be made to this output - edit the
 ``inventory/host_vars/mgmtserver1.stage1.menlo.yml`` file and add the following
-to the bottom of the file, replacing the IP addresses with the ones that the
-management server is configured with on each VLAN. This configures the `netplan
-<https://netplan.io>`_ on the management server, and will be automated away
-soon::
+to the bottom of the file, replacing the IP addresses with *only the lowest
+numbered IP address* the management server has on each VLAN (if >1 IP address
+is assigned to a VLAN or Interface, the DHCP server will fail to run). This
+configures the `netplan <https://netplan.io>`_ on the management server, and
+will be automated away soon::
 
   # added manually
   netprep_netplan:
@@ -479,15 +366,15 @@
         addresses:
           - 10.0.1.1/25
 
-Create an inventory file for the management server in
-``inventory/menlo-staging.ini`` which contains::
-
-  [mgmt]
-  mgmtserver1.stage1.menlo ansible_host=<public ip address> ansible_user="onfadmin" ansible_become_password=<password>
+Using the ``inventory/example-aether.ini`` as a template, create an
+:doc:`ansible inventory <ansible:user_guide/intro_inventory>` file for the
+site. Change the device names, IP addresses, and ``onfadmin`` password to match
+the ones for this site.  The management server's configuration is in the
+``[aethermgmt]`` and corresponding ``[aethermgmt:vars]`` section.
 
 Then, to configure a management server, run::
 
-  ansible-playbook -i inventory/menlo-staging.ini playbooks/aethermgmt-playbook.yml
+  ansible-playbook -i inventory/sitename.ini playbooks/aethermgmt-playbook.yml
 
 This installs software with the following functionality:
 
@@ -496,6 +383,8 @@
 - DHCP and TFTP for bootstrapping servers and switches
 - DNS for host naming and identification
 - HTTP server for serving files used for bootstrapping switches
+- Downloads the Tofino switch image
+- Creates user accounts for administrative access
 
 Compute Server Bootstrap
 """"""""""""""""""""""""
@@ -520,4 +409,68 @@
   login: ADMIN
   password: Admin123
 
-Once these nodes are brought up, the installation can continue.
+The BMC will also list all of the MAC addresses for the network interfaces
+(including BMC) that are built into the logic board of the system. Add-in
+network cards like the 40GbE ones used in compute servers aren't listed.
+
+To prepare the compute nodes, software must be installed on them.  As they
+can't be accessed directly from your local system, a :ref:`jump host
+<ansible:use_ssh_jump_hosts>` configuration is added, so the SSH connection
+goes through the management server to the compute systems behind it. Doing this
+requires a few steps:
+
+First, configure SSH to use Agent forwarding - create or edit your
+``~/.ssh/config`` file and add the following lines::
+
+  Host <management server IP>
+    ForwardAgent yes
+
+Then try to login to the management server, then the compute node::
+
+  $ ssh onfadmin@<management server IP>
+  Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-54-generic x86_64)
+  ...
+  onfadmin@mgmtserver1:~$ ssh onfadmin@10.0.0.138
+  Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-54-generic x86_64)
+  ...
+  onfadmin@node2:~$
+
+Being able to login to the compute nodes from the management node means that
+SSH Agent forwarding is working correctly.
+
+Verify that your inventory (Created earlier from the
+``inventory/example-aether.ini`` file) includes an ``[aethercompute]`` section
+that has all the names and IP addresses of the compute nodes in it.
+
+Then run a ping test::
+
+  ansible -i inventory/sitename.ini -m ping aethercompute
+
+It may ask you about authorized keys - answer ``yes`` for each host to trust the keys::
+
+  The authenticity of host '10.0.0.138 (<no hostip for proxy command>)' can't be established.
+  ECDSA key fingerprint is SHA256:...
+  Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
+
+You should then see a success message for each host::
+
+  node1.stage1.menlo | SUCCESS => {
+      "changed": false,
+      "ping": "pong"
+  }
+  node2.stage1.menlo | SUCCESS => {
+      "changed": false,
+      "ping": "pong"
+  }
+  ...
+
+Once you've seen this, run the playbook to install the prerequisites (Terraform
+user, Docker)::
+
+  ansible-playbook -i inventory/sitename.ini playbooks/aethercompute-playbook.yml
+
+Note that Docker is quite large and may take a few minutes for installation
+depending on internet connectivity.
+
+Now that these compute nodes have been brought up, the rest of the installation
+can continue.

diff --git a/pronto_deployment_guide/overview.rst b/pronto_deployment_guide/overview.rst
index 18d289a..10f2e84 100644
--- a/pronto_deployment_guide/overview.rst
+++ b/pronto_deployment_guide/overview.rst

@@ -2,6 +2,120 @@
    SPDX-FileCopyrightText: © 2020 Open Networking Foundation <support@opennetworking.org>
    SPDX-License-Identifier: Apache-2.0
 
-========
 Overview
 ========
+
+A Pronto deployment must have a detailed plan of the network topology and
+devices, and required cabling before being put assembled.
+
+Once planning is complete, equipment should be ordered to match the plan. The
+VAR we've used for most Pronto equipment is ASA (aka "RackLive").
+
+Network Cable Plan
+------------------
+
+If a 2x2 TOST fabric is used it should be configured as a :doc:`Single-Stage
+Leaf-Spine <trellis:supported-topology>`.
+
+- The links between each leaf and spine switch must be made up of two separate
+  cables.
+
+- Each compute server is dual-homed via a separate cable to two different leaf
+  switches (as in the "paired switches" diagrams).
+
+If only a single P4 switch is used, the :doc:`Simple
+<trellis:supported-topology>` topology is used, with two connections from each
+compute server to the single switch
+
+Additionally a non-fabric switch is required to provide a set of management
+networks.  This management switch is configured with multiple VLANs to separate
+the management plane, fabric, and the out-of-band and lights out management
+connections on the equipment.
+
+
+Required Hardware
+-----------------
+
+Fabric Switches
+"""""""""""""""
+
+Pronto currently uses fabric switches based on the Intel (was Barefoot) Tofino
+chipset.  There are multiple variants of this switching chipset, with different
+speeds and capabilities.
+
+The specific hardware models in use in Pronto:
+
+* `EdgeCore Wedge100BF-32X
+  <https://www.edge-core.com/productsInfo.php?cls=1&cls2=180&cls3=181&id=335>`_
+  - a "Dual Pipe" chipset variant, used for the Spine switches
+
+* `EdgeCore Wedge100BF-32QS
+  <https://www.edge-core.com/productsInfo.php?cls=1&cls2=180&cls3=181&id=770>`_
+  - a "Quad Pipe" chipset variant, used for the Leaf switches
+
+Compute Servers
+"""""""""""""""
+
+These servers run Kubernetes and edge applications.
+
+The requirements for these servers:
+
+* AMD64 (aka x86-64) architecture
+* Sufficient resources to run Kubernetes
+* Two 40GbE or 100GbE Ethernet connections to the fabric switches
+* One management 1GbE port
+
+The specific hardware models in use in Pronto:
+
+* `Supermicro 6019U-TRTP2
+  <https://www.supermicro.com/en/products/system/1U/6019/SYS-6019U-TRTP2.cfm>`_
+  1U server
+
+* `Supermicro 6029U-TR4
+  <https://www.supermicro.com/en/products/system/2U/6029/SYS-6029U-TR4.cfm>`_
+  2U server
+
+These servers are configured with:
+
+* 2x `Intel Xeon 5220R CPUs
+  <https://ark.intel.com/content/www/us/en/ark/products/199354/intel-xeon-gold-5220r-processor-35-75m-cache-2-20-ghz.html>`_,
+  each with 24 cores, 48 threads
+* 384GB of DDR4 Memory, made up with 12x 16GB ECC DIMMs
+* 2TB of nVME Flash Storage
+* 2x 6TB SATA Disk storage
+* 2x 40GbE ports using an XL710QDA2 NIC
+
+The 1U servers additionally have:
+
+- 2x 1GbE copper network ports
+- 2x 10GbE SFP+ network ports
+
+The 2U servers have:
+
+- 4x 1GbE copper network ports
+
+Management Server
+"""""""""""""""""
+
+One management server is required, which must have at least two 1GbE network
+ports, and runs a variety of network services to support the edge.
+
+The model used in Pronto is a `Supermicro 5019D-FTN4
+<https://www.supermicro.com/en/Aplus/system/Embedded/AS-5019D-FTN4.cfm>`_
+
+Which is configured with:
+
+* AMD Epyc 3251 CPU with 8 cores, 16 threads
+* 32GB of DDR4 memory, in 2x 16GB ECC DIMMs
+* 1TB of nVME Flash storage
+* 4x 1GbE copper network ports
+
+Management Switch
+"""""""""""""""""
+
+This switch connects the configuration interfaces and management networks on
+all the servers and switches together.
+
+In the Pronto deployment this hardware is a `HP/Aruba 2540 Series JL356A
+<https://www.arubanetworks.com/products/switches/access/2540-series/>`_.
+

diff --git a/pronto_deployment_guide/troubleshooting.rst b/pronto_deployment_guide/troubleshooting.rst
new file mode 100644
index 0000000..7828b76
--- /dev/null
+++ b/pronto_deployment_guide/troubleshooting.rst

@@ -0,0 +1,55 @@
+..
+   SPDX-FileCopyrightText: © 2020 Open Networking Foundation <support@opennetworking.org>
+   SPDX-License-Identifier: Apache-2.0
+
+Troubleshooting
+===============
+
+Unknown MAC addresses
+---------------------
+
+Sometimes it's hard to find out all the MAC addresses assigned to network
+cards. These can be found in a variety of ways:
+
+1. On servers, the BMC webpage will list the built-in network card MAC
+   addresses.
+
+2. If you login to a server, ``ip link`` or ``ip addr`` will show the MAC
+   address of each interface, including on add-in cards.
+
+3. If you can login to a server but don't know the BMC IP or MAC address for
+   that server, you can find it with ``sudo ipmitool lan print``.
+
+4. If you don't have a login to the server, but can get to the management
+   server, ``ip neighbor`` will show the arp table of MAC addresses known to
+   that system.  It's output is unsorted  - ``ip neigh | sort`` is easier to
+   read.
+
+Cabling issues
+--------------
+
+The system may not come up correctly if cabling isn't connected properly.
+If you don't have hands-on with the cabling, here are some ways to check on the
+cabling remotely:
+
+1. On servers you can check which ports are connected with ``ip link show``::
+
+    $ ip link show
+    ...
+    3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
+        link/ether 3c:ec:ef:4d:55:a8 brd ff:ff:ff:ff:ff:ff
+    ...
+    5: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
+        link/ether 3c:ec:ef:4d:55:a9 brd ff:ff:ff:ff:ff:ff
+
+  Ports that are up will show ``state UP``
+
+2. You can determine which remote ports are connected with LLDP, assuming that
+   the remote switch supports LLDP and has it enabled. This can be done with
+   ``networkctl lldp``, which shows both the name and the MAC address of the
+   connected switch on a per-link basis::
+
+      $ networkctl lldp
+      LINK             CHASSIS ID        SYSTEM NAME      CAPS        PORT ID           PORT DESCRIPTION
+      eno1             10:4f:58:e7:d5:60 Aruba-2540-24…PP ..b........ 10                10
+      eno2             10:4f:58:e7:d5:60 Aruba-2540-24…PP ..b........ 1                 1
commit	9026f53e3b237e414c22ad604ed69c90324929c6	[log] [tgz]
author	Zack Williams <zdw@opennetworking.org>	Mon Nov 30 11:34:32 2020 -0700
committer	Zack Williams <zdw@opennetworking.org>	Tue Dec 01 21:30:46 2020 -0700
tree	a620f52aefe35c407c028f3acfc33ea22419c356
parent	6bfbf8ae64a90300ceffaf495d00ba49548218fd [diff]