CORD POD Test-cases
This is a rough sketch of planned test-cases, organized in areas. Regard it as a wish-list. Feel free to contribute to the list and also use the list to get idea(s) where test implementation is needed.
Test-Cases
Test-cases are organized in the following categories:
- Deployment tests
- Baseline readiness tests
- Functional end-user tests
- Transient, fault, HA tests
- Scale tests
- Security tests
- Soak tests
Some test-cases may re-use other test-cases as part of more complex scenarios.
Deployment Tests
The scope and objective of these test-cases is to run the automated deployment process on a "pristine" CORD POD and verify that at the end the system gets into a known (verifiable) baseline state, as well as that the feedback from the automated deployment process is consistent with the outcome (no false positives or negatives).
Positive test-cases:
- Bring-up and verify basic infrastructure assumptions (maybe just that the head-end is available for remote install)
- Execute automated deployment of CORD infrastructure and verify baseline state. Various options needs to be supported:
- Single head-node setup (no clustering)
- Triple-head-node setup
- Single data-plane up-link from servers
- Dual data-plane up-link from servers
Negative test-cases:
- Verify that deployment automation detects missing equipment
- Verify that deployment automation detects missing cable
- Verify that deployment automation detects mis-cabling of fabric and provides useful feedback to remedy the issue
Baseline Readiness Tests
- Verify API availability (XOS, ONOS, OpenStack, etc.)
- Verify software process inventory (of those processes that are covered by the baseline bring-up)
Functional End-User Tests
Positive test-cases:
- Verify that a new OLT can be added to the POD and it is properly initialized
- Verify that a new ONU can be added to the OLT and it becomes visible in the system
- Verify that a new RG can authenticate and gets admitted to the system (receives an IP address)
- Verify that the RG can access the Intranet and the Internet
- Verify that the RG receives periodic IGMP XXX messages
- Verify that the RG can join a multicast channel and starts receiving bridge flow
- Verify that the RG, after joining, starts receiving multicast flow within tolerance interval
- Verify that the RG can join multiple multicast streams simultaneously
- Verify that the RG receives periodic IGMP reports
Complex test-cases:
- Measure channel surfing experience
- Replacing RG for existing subscriber
- Moving existing subscriber to a new address (same RG, new location)
Negative test-cases:
- Verify that a subscriber that is not registered cannot join the network
- Verify that a subscriber RG cannot be added unless it is on the pre-prescribed port (OLT/ONU port?)
- Verify that a subscriber that has no Internet access cannot reach the Internet
- Verify that a subscriber with limited channel access cannot subscribe to disabled/prohibited channels
- Verify that a subscriber identity cannot be re-used at a different RG (no two RGs with the same certificate can ever be logged into the system)
Transient, fault, HA Tests
In this block, test-cases should cover the following scenarios:
Hardware disruption scenarios cycling scenarios:
In the following scenarios, in cases of non-HA setups, the system shall at least recover after the hardware component is restored. In HA scenarios, the system shall be able to ride these scenarios through without service interrupt.
- Power cycling OLT
- Power cycling ONU
- Re-starting RG
- Power cycling any server (one at a time)
- Power cycling a fabric switch
- Power cycling any of the VMs
- Power cycling management switch
- Replacing a server-to-leaf cable
- Replacing a leaf-to-spine cable
In HA scenarios, the following shall result in only degraded service, but not loss of service:
- Powering off a server (and keep it powered off)
- Powering off a spine fabric switch
- Powering off a leaf fabric switch
- Removing a server-to-leaf cable (emulating DAC failure)
- Removing a leaf-to-spine cable (emulating DAC failure)
- Powering off management switch
- Powering back each of the above
Process cycling scenarios:
- Restarting any of the processes
- Killing any of the processes (system shall recover with auto-restart)
- Killing and restoring containers
- Relocation scenarios [TBD]
Additive scenarios:
- Add a new spine switch to the system
- Add a new compute server to the system
- Add a new head node to the system
Scale Tests
Test load input dimensions to track against:
- Number of subscribers
- Number of routes pushed to CORD POD
- Number of NBI API sessions
- Number of NBI API requests
- Subscriber channel change rate
- Subscriber aggregate traffic load to Internet
In addition to healthy operation, the following is the list contains what needs to be measured quantitatively, as a function of input load:
- CPU utilization per each server
- Disk utilization per each server
- Network utilization at various capture points (fabric ports to start with)
- Channel change "response time" (how long it takes to start receiving bridge traffic as well as real multicast feed)
- Internet access round-trip time
Security Tests
The purpose of these tests is to detect vulnerabilities across the various surfaces of CORD, including:
- PON ports (via ONU ports)
- NBI APIs
- Internet up-link
- CORD POD-Local penetration tests
- Via patch cable into management switch
- Via fabric ports
- Via unused NIC ports of server(s)
- Via local console (only if secure boot is enabled)
Tests shall include:
- Port scans on management network: only a pre-defined list of ports shall be open
- Local clustering shall be VLAN-isolated from the management network
- Qualys free scan
- SSH vulnerability scans
- SSL certificate validation
[TBD: define more specific test scenarios]
In addition, proprietary scans, such as Nessus Vulnerability Scan will be performed prior to major releases by commercial CORD vendor Ciena.
Soak Tests
This is really one comprehensive multi-faceted test run on the POD, involving the following steps:
Preparation phase:
- Deploy system using the automated deployment process
- Verify baseline acceptance
- Admit a preset number of RGs
- Subscribe to a pre-configured set of multicast feeds
- Start a nominal Internet access load pattern on each RG
- Optionally (per test config): start background scaled-up load (dpdk-pktgen based)
- Capture baseline resource usage (memory, disk utilization per server, per vital process)
Soak phase (sustained for a preset time period (8h, 24h, 72h, etc.):
- Periodically monitor health of ongoing sessions (emulated RGs happy?)
- Periodically test presence of all processes
- Check for stable process ids (rolling id can be a sign of a restarted process)
- Periodically capture resource usage, including:
- process memory use
- file descriptors
- disk space
- disk io
- flow table entries in soft and fabric switches
Final check:
- Final capture of resource utilization and health report
Baseline Acceptance Criteria
The baseline acceptance is based on a list of criteria, including:
On all servers involved in the POD:
- Verify BIOS settings (indirectly)
- Verify kernel boot options
- Verify OS version
- Verify kernel driver options for NICs (latest driver)
- Verify kernel settings
- Verify software inventory (presence and version) of following as applicable
- DPDK version
- ovs version
- etc.