blob: 6df45499d36375be647d13ab4b818f86b5997cc0 [file] [log] [blame]
Zack Williams9026f532020-11-30 11:34:32 -07001..
2 SPDX-FileCopyrightText: © 2020 Open Networking Foundation <support@opennetworking.org>
3 SPDX-License-Identifier: Apache-2.0
4
5Troubleshooting
6===============
7
Zack Williams9d94b4f2020-12-14 11:25:29 -07008
9Firewalls and other host network issues
10---------------------------------------
11
Zack Williams5fd7a232020-12-03 12:45:56 -070012Unable to access a system
Zack Williams9d94b4f2020-12-14 11:25:29 -070013"""""""""""""""""""""""""
Zack Williams5fd7a232020-12-03 12:45:56 -070014
15If it's a system behind another system (ex: the compute nodes behind a
16management server) and you're trying to interactively login to it, make sure
17that you've enabled SSH Agent Forwarding in your ``~/.ssh/config`` file::
18
19 Host mgmtserver1.prod.site.aetherproject.net
20 ForwardAgent yes
21
22If you still have problems after verifying that this is set up, run ssh with
23the ``-v`` option, which will print out all the connection details and
24whether an agent is used on the second ssh::
25
26 onfadmin@mgmtserver1:~$ ssh onfadmin@node2.mgmt.prod.site.aetherproject.net
27 debug1: client_input_channel_open: ctype auth-agent@openssh.com rchan 2 win 65536 max 16384
28 debug1: channel 1: new [authentication agent connection]
29 debug1: confirm auth-agent@openssh.com
30 Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-56-generic x86_64)
31 ...
32 onfadmin@node2:~$
33
Zack Williams9d94b4f2020-12-14 11:25:29 -070034Root/Public DNS port is blocked
35"""""""""""""""""""""""""""""""
36
37In some cases access to the public DNS root and other servers is blocked, which
38prevents DNS lookups from working within the pod.
39
40To resolve this, forwarding addresses on the local network can be provided in
41the Ansible YAML ``host_vars`` file, using the ``unbound_forward_zones`` list
42to configure the Unbound recursive nameserver. An example::
43
44 unbound_forward_zones:
45 - name: "."
46 servers:
47 - "8.8.8.8"
48 - "8.8.4.4"
49
50
51The items in the ``servers`` list would be the locally accessible nameservers.
52
Zack Williams5fd7a232020-12-03 12:45:56 -070053Problems with OS installation
54-----------------------------
55
Zack Williamse8c3b2c2021-02-01 12:47:28 -070056iPXE doesn't load a Menu when started
57"""""""""""""""""""""""""""""""""""""
58
59The URLs that iPXE provides if there is an error take you into it's
60documentation, which is of high quality and should explain the error in much
61greater detail - for example `https://ipxe.org/3e11623b
62<https://ipxe.org/3e11623b>`_ explains that the DNS server address provided by
63DHCP is not functional.
64
65The most common failures would be in network settings being incorrect, which
66should be shown when the menu loads in step 4. If the menu does not load, and
67you get an `iPXE>` Shell prompt, type::
68
69 config
70
71And you should get the iPXE configuration screen, which lists all of the
72configuration parameters discovered:
73
74 .. image:: images/mgmtsrv-007.png
75 :alt: iPXE config menu
76 :scale: 50%
77
78Most likely there's something wrong with the network configuration provided by
79DHCP - you can scroll this menu with arrow keys to find all the settings
80provided by the DHCP server, and SMBIOS information provided by the hardware.
81
Zack Williams5fd7a232020-12-03 12:45:56 -070082OS installs, but doesn't boot
83"""""""""""""""""""""""""""""
84
85If you've completed the installation but the system won't start the OS, check
86these BIOS settings:
87
88- If the startup disk is nVME, under ``Advanced -> PCIe/PCI/PnP Configuration``
89 the option ``NVMe Firmware Source`` should be set to ``AMI Native Support``,
Zack Williamse8c3b2c2021-02-01 12:47:28 -070090 per `Supermicro FAQ entry 28248
91 <https://www.supermicro.com/support/faqs/faq.cfm?faq=28248>`_.
Zack Williams5fd7a232020-12-03 12:45:56 -070092
Zack Williams9026f532020-11-30 11:34:32 -070093Unknown MAC addresses
94---------------------
95
96Sometimes it's hard to find out all the MAC addresses assigned to network
97cards. These can be found in a variety of ways:
98
991. On servers, the BMC webpage will list the built-in network card MAC
100 addresses.
101
1022. If you login to a server, ``ip link`` or ``ip addr`` will show the MAC
103 address of each interface, including on add-in cards.
104
1053. If you can login to a server but don't know the BMC IP or MAC address for
106 that server, you can find it with ``sudo ipmitool lan print``.
107
1084. If you don't have a login to the server, but can get to the management
109 server, ``ip neighbor`` will show the arp table of MAC addresses known to
110 that system. It's output is unsorted - ``ip neigh | sort`` is easier to
Zack Williams5fd7a232020-12-03 12:45:56 -0700111 read. This can be useful for determining if there's a cabling problem -
112 a device plugged into the wrong port of the management switch could show up
113 in the DHCP pool range for a different segment.
Zack Williams9026f532020-11-30 11:34:32 -0700114
115Cabling issues
116--------------
117
118The system may not come up correctly if cabling isn't connected properly.
119If you don't have hands-on with the cabling, here are some ways to check on the
120cabling remotely:
121
1221. On servers you can check which ports are connected with ``ip link show``::
123
124 $ ip link show
125 ...
126 3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
127 link/ether 3c:ec:ef:4d:55:a8 brd ff:ff:ff:ff:ff:ff
128 ...
129 5: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
130 link/ether 3c:ec:ef:4d:55:a9 brd ff:ff:ff:ff:ff:ff
131
132 Ports that are up will show ``state UP``
133
1342. You can determine which remote ports are connected with LLDP, assuming that
135 the remote switch supports LLDP and has it enabled. This can be done with
136 ``networkctl lldp``, which shows both the name and the MAC address of the
137 connected switch on a per-link basis::
138
139 $ networkctl lldp
140 LINK CHASSIS ID SYSTEM NAME CAPS PORT ID PORT DESCRIPTION
141 eno1 10:4f:58:e7:d5:60 Aruba-2540-24…PP ..b........ 10 10
142 eno2 10:4f:58:e7:d5:60 Aruba-2540-24…PP ..b........ 1 1
Zack Williams4c1eab92021-05-28 11:37:14 -0700143
144
145Problems with ONIE Installation
146-------------------------------
147
148Can't reboot into ONL, loops on ONIE installer mode
149---------------------------------------------------
150
151Sometimes an ONL installation is incomplete or problematic, and reinstalling it
152doesn't result in a working system.
153
154If this is the case, reboot into ONIE Rescue mode and use ``parted`` to delete
155all the ``ONL-`` prefixed partitions, then reinstall with an ``onie-installer``
156image.
157