Scott Baker | a91cbd5 | 2021-07-28 09:23:08 -0700 | [diff] [blame] | 1 | .. |
| 2 | SPDX-FileCopyrightText: © 2020 Open Networking Foundation <support@opennetworking.org> |
| 3 | SPDX-License-Identifier: Apache-2.0 |
| 4 | |
| 5 | Runtime Operational Control (ROC) |
| 6 | ================================= |
| 7 | |
| 8 | Purpose |
| 9 | ------- |
| 10 | |
| 11 | The Aether Runtime Operation Control (ROC) is a component designed with the primary purpose of managing the |
| 12 | Aether Connectivity Service (ACS), including facilitating the integration of edge services with the ACS. |
| 13 | The Aether ROC allows enterprises to configure subscribers and profiles, as well as implement policies related |
| 14 | to those profiles. It also allows the Aether operations team to configure the parameters of those policies. |
| 15 | The ROC is one of many subsystems that make up the Aether Management Platform (AMP). |
| 16 | |
| 17 | What the ROC *does* do: |
| 18 | |
| 19 | - Add/Update/Delete/Query configuration |
| 20 | |
| 21 | - Persist configuration |
| 22 | |
| 23 | - Push configuration to services and devices |
| 24 | |
| 25 | - Make observations actionable, either manually or automatically |
| 26 | |
| 27 | What the ROC *does not* do: |
| 28 | |
| 29 | - The ROC does not directly deploy or manage the lifecycle of containers. |
| 30 | This is done using the Terraform/Rancher/Helm/Kubernetes stack. |
| 31 | |
| 32 | - The ROC does not directly collect or store logging or metric information. |
| 33 | This is done using the ElasticStack and Grafana/Prometheus components. |
| 34 | |
| 35 | - The ROC is not a message bus used for component-to-component communication. |
| 36 | If a message bus is required, then a suitable service such as Kafka could be used. |
| 37 | |
| 38 | - The ROC does not implement a service dependency graph. |
| 39 | This can be done through helm charts, which are typically hierarchical in nature. |
| 40 | |
| 41 | - The ROC is not a formal service mesh. |
| 42 | Other tools, such as Istio, could be leveraged to provide service meshes. |
| 43 | |
| 44 | - The ROC does not configure *Edge Services*. |
| 45 | While the ROC’s modeling support is general and could be leveraged to support an edge service, and an |
| 46 | adapter could be written to configure an edge service, promoting an edge service to ROC management would |
| 47 | be the exception rather than the rule. Edge services have their own GUIs and APIs, perhaps belonging to |
| 48 | a 3rd-party service provider. |
| 49 | |
| 50 | Although we call out the tasks that ROC doesn’t do itself, it’s often still necessary for the ROC to be aware |
| 51 | of the actions these other components have taken. |
| 52 | For example, while the ROC doesn’t implement a service dependency graph, it is the case that the ROC is aware |
| 53 | of how services are related. This is necessary because some of the actions it takes affect multiple services |
| 54 | (e.g., a ROC-supported operation on a subscriber profile might result in the ROC making calls to SD-Core, |
| 55 | SD-RAN, and SD-Fabric). |
| 56 | |
| 57 | Throughout the design process, the ROC design team has taken lessons learned from prior systems, such as XOS, |
| 58 | and applied them to create a next generation design that focuses on solving the configuration problem in a |
| 59 | focused and lightweight manner. |
| 60 | |
| 61 | Design and Requirements |
| 62 | ----------------------- |
| 63 | |
| 64 | - The ROC must offer an *API* that may be used by administrators, as well as external services, to configure |
| 65 | Aether. |
| 66 | |
| 67 | - This ROC API must support new end-to-end abstractions that cross multiple subsystems of Aether. |
| 68 | For example, “give subscriber X running application Y QoS guarantee Z'' is an abstraction that potentially |
| 69 | spans SD-RAN, SD-Fabric. |
| 70 | The ROC defines and implements such end-to-end abstractions. |
| 71 | |
| 72 | - The ROC must offer an *Operations GUI* to Operations Personnel, so they may configure the Aether Connectivity |
| 73 | service. |
| 74 | |
| 75 | - The ROC must offer an *Enterprise GUI* to Enterprise Personnel, so they may configure the connectivity aspects |
| 76 | of their particular edge site. |
| 77 | It’s possible this GUI shares implementation with the Operations GUI, but the presentation, content, and |
| 78 | workflow may differ. |
| 79 | |
| 80 | - The ROC must support *versioning* of configuration, so changes can be rolled back as necessary, and an audit |
| 81 | history may be retrieved of previous configurations. |
| 82 | |
| 83 | - The ROC must support best practices of *performance*, *high availability*, *reliability*, and *security*. |
| 84 | |
| 85 | - The ROC must support *role-based access controls (RBAC)*, so that different parties have different visibility |
| 86 | into the data model. |
| 87 | |
| 88 | - The ROC must be extensible. |
| 89 | Aether will incorporate new services over time, and existing services will evolve. |
| 90 | |
| 91 | Data Model |
| 92 | ---------- |
| 93 | |
| 94 | An important aspect of the ROC is that it maintains a data model that represents all the abstractions, such as |
| 95 | subscribers and profiles, it is responsible for. |
| 96 | The ROC’s data model is based on YANG specifications. |
| 97 | YANG is a rich language for data modeling, with support for strong validation of the data stored in the models. |
| 98 | YANG allows relations between objects to be specified, adding a relational aspect that our previous approaches |
| 99 | (for example, protobuf) did not directly support. |
| 100 | YANG is agnostic as to how the data is stored, and is not directly tied to SQL/RDBMS or NoSQL paradigms. |
| 101 | |
| 102 | ROC uses tooling built around aether-config (an ONOS-based microservice) to maintain a set of YANG models. |
| 103 | Among other things, aether-config implements model versioning. |
| 104 | Migration from one version of the data model to another is supported, as is simultaneous operation of |
| 105 | different versions. |
| 106 | |
| 107 | Architecture |
| 108 | ------------ |
| 109 | |
| 110 | Below is a high-level architectural diagram of the ROC: |
| 111 | |
| 112 | .. image:: images/aether-architecture.svg |
| 113 | |
| 114 | The following walks through the main stack of ROC components in a top-down manner, starting with the GUI(s) and |
| 115 | ending with the devices/services. |
| 116 | |
| 117 | Operations Portal / Enterprise Portal |
| 118 | """"""""""""""""""""""""""""""""""""" |
| 119 | |
| 120 | The code base for the Operations Portal and Enterprise Portal is shared. |
| 121 | They are two different perspectives of the same portal. |
| 122 | The *Operations Portal* presents a rougher, more expansive view of the breadth of the Aether modeling. |
| 123 | The Enterprise Portal presents a more curated view of the modeling. |
| 124 | These different perspectives can be enforced through the following: |
| 125 | |
| 126 | - RBAC controls, to limit access to information that might be unsuitable for a particular party. |
| 127 | |
| 128 | - Dashboards, to aggregate/present information in an intuitive manner |
| 129 | |
| 130 | - Multi-step workflows (aka Wizards) to break a complex task into smaller guided steps. |
| 131 | |
| 132 | The *Portal* is an angular-based typescript GUI. |
| 133 | The GUI uses REST API to communicate with the aether-roc-api layer, which in turn communicates with aether-config |
| 134 | via gNMI. |
| 135 | The GUI implementation is consistent with modern GUI design, implemented as a single-page application and includes |
| 136 | a “commit list” that allows several changes to be atomically submitted together. |
| 137 | Views within the GUI are handcrafted, and as new models are added to Aether, the GUI must be adapted to incorporate |
| 138 | the new models. |
| 139 | |
| 140 | The Portal is a combination of control and observation. |
| 141 | The control aspect relates to pushing configuration, and the observation aspect relates to viewing metrics, |
| 142 | logging, and alerts. |
| 143 | The Portal will leverage other components to do some of the heavy lifting. |
| 144 | For example, it would make no sense for us to implement our own graph-drawing tool or our own metrics querying |
| 145 | language when Grafana and Prometheus are already able to do that and we can leverage them. |
| 146 | GUI pages can be constructed that embed the Grafana renderer. |
| 147 | |
| 148 | aether-roc-api |
| 149 | """""""""""""" |
| 150 | |
| 151 | Aether-roc-api a REST API layer that sits between the portals and aether-config. |
| 152 | The southbound layer of aether-roc-api is gNMI. |
| 153 | This is how aether-roc-api talks to aether-config. |
| 154 | Aether-roc-api at this time is entirely auto-generated; developers need not spend time manually creating REST APIs |
| 155 | for their models. |
| 156 | The API layer serves multiple purposes: |
| 157 | |
| 158 | - gNMI is an inconvenient interface to use for GUI design, and REST is expected for GUI development. |
| 159 | |
| 160 | - The API layer is a potential location for early validation and early security checking, allowing errors to be caught |
| 161 | closer to the user. |
| 162 | This allows error messages to be generated in a more customary way than gNMI. |
| 163 | |
| 164 | - The API layer is yet another place for semantic translation to take place. |
| 165 | Although the API layer is currently auto-generated, it is possible that additional methods could be added. |
| 166 | gNMI supports only “GET” and “SET”, whereas the aether-roc-api natively supports “GET”, “PUT”, “POST”, “PATCH”, |
| 167 | and “DELETE”. |
| 168 | |
| 169 | aether-config |
| 170 | """"""""""""" |
| 171 | |
| 172 | *Aether-config* (a Aether-specific deployment of the “\ *onos-config*\ ” microservice) is the core of the ROC’s |
| 173 | configuration system. |
| 174 | Aether-config is a component that other teams may use in other contexts. |
| 175 | It’s possible that an Aether deployment might have multiple instances of aether-config used for independent purposes. |
| 176 | The job of aether-config is to store and version configuration data. |
| 177 | Configuration is pushed to aether-config through the northbound gNMI interface, is stored in an Atomix database |
| 178 | (not shown in the figure), and is pushed to services and devices using a southbound gNMI interface. |
| 179 | |
| 180 | Adapters |
| 181 | """""""" |
| 182 | |
| 183 | Not every device or service beneath the ROC supports gNMI, and in the case where it is not supported, an adapter is |
| 184 | written to translate between gNMI and the device’s or service’s native API. |
| 185 | For example, a gNMI → REST adapter exists to translate between the ROC’s modeling and the Aether Connectivity |
| 186 | Control (SD-Core) components. The adapter is not necessarily only a syntactic translation, but may also be a |
| 187 | semantic translation. |
| 188 | [1]_ This supports a logical decoupling of the models stored in the ROC and the interface used by the southbound |
| 189 | device/service, allowing the southbound device/service and the ROC to evolve independently. |
| 190 | It also allows for southbound devices/services to be replaced without affecting the northbound interface. |
| 191 | |
| 192 | Workflow Engine |
| 193 | """"""""""""""" |
| 194 | |
| 195 | The workflow engine, to the left of the aether-config stack, is where multi-step workflows may be implemented. |
| 196 | At this time we do not have these workflows, but during the experience with SEBA/VOLTHA, we learned that workflow |
| 197 | became a key aspect of the implementation. |
| 198 | For example, SEBA had a state machine surrounding how devices were authorized, activated, and deactivated. |
| 199 | The workflow engine is a placeholder where workflows may be implemented in Aether as they are required. |
| 200 | |
| 201 | Another use of the workflow engine may be to translate between levels in modeling. |
| 202 | For example, the workflow engine may examine the high-level Enterprise modeling and make changes to the Operations |
| 203 | modeling to achieve the Enterprise behavior. |
| 204 | |
| 205 | Previously this component was referred to as “onos-ztp”. |
| 206 | It is expected that a workflow engine would both read and write the aether-config data model, as well as respond to |
| 207 | external events. |
| 208 | |
| 209 | Analytics Engine |
| 210 | """""""""""""""" |
| 211 | |
| 212 | The analytics engine, to the right of the aether-config stack, is where enrichment of analytics will be performed. |
| 213 | Raw metrics and logs are collected with open source components Grafana/Prometheus and ElasticStack. |
| 214 | Those metrics might need additional transformation before they can be presented to Enterprise users, or in some |
| 215 | cases even before they are presented to the Ops team. |
| 216 | The Analytics engine would be a place where those metrics could be transformed or enriched, and then written back |
| 217 | to Prometheus or Elastic (or forwarded as alerts). |
| 218 | |
| 219 | The analytics engine is also where analytics would be related to config models in aether-config, in order for |
| 220 | Enterprise or Operations personnel to take action in response to data and insights received through analytics. |
| 221 | Action doesn’t necessarily have to involve humans. |
| 222 | It is expected that the combination of Analytics Engine and Workflow Engine could automate a response. |
| 223 | |
| 224 | The analytics engine also provides an opportunity to implement access control from the telemetry API. |
| 225 | Prometheus itself is not multi-tenant and does not support fine-grained access controls. |
| 226 | |
| 227 | Aether Operator |
| 228 | """"""""""""""" |
| 229 | |
| 230 | Not pictured in the diagram is the ONOS Operator, which is responsible for configuring the models within |
| 231 | aether-config. Models to load are specified by a helm chart. |
| 232 | The operator compiles them on demand and incorporates them into aether-config. |
| 233 | This eliminates dynamic load compatibility issues that were previously a problem with building models and |
| 234 | aether-config separately. Operators are considered a best practice in Kubernetes. |
| 235 | |
| 236 | Modules are loaded into the process primarily for performance and simplicity reasons. |
| 237 | The design team has had experience with other systems (for example, Voltha and XOS) where modules were decoupled |
| 238 | and message buses introduced between them, but that can lead to both complexity issues and performance bottlenecks |
| 239 | in those systems. The same module and operator pattern will be applied to aether-roc-api. |
| 240 | |
| 241 | Aether Modeling |
| 242 | --------------- |
| 243 | |
| 244 | There is no fixed distinction between high-level and low-level modeling in the ROC. |
| 245 | There is one set of Aether modeling that might have customer-facing and internal-facing aspects. |
| 246 | |
| 247 | .. image:: images/aether-highlevel.svg |
| 248 | |
| 249 | The above diagram is an example of how a single set of models could serve both high-level and low-level needs and |
| 250 | is not necessarily identical to the current implementation. |
| 251 | For example, *App* and *Service* are concepts that are necessarily enterprise-facing. |
| 252 | *UPF*\ s are concepts that are operator-facing. |
| 253 | A UPF might be used by a Service, but the customer need not be aware of this detail. |
| 254 | Similarly, some objects might be partially customer-facing and partially operator-facing. |
| 255 | For example, a *Radio* is a piece of hardware the customer has deployed on his premises, so he must know of it, but |
| 256 | the configuration details of the radio (signal strength, IP address, etc) are operator-facing. |
| 257 | |
| 258 | An approximation of the current Aether-3.0 (Release 1.5) modeling is presented below: |
| 259 | |
| 260 | .. image:: images/aether-3.0-models.svg |
| 261 | |
| 262 | The key Enterprise-facing abstractions are *Applicatio*\ n, *Virtual Cellular Service* (VCS), and *DeviceGroup*. |
| 263 | |
| 264 | Identity Management |
| 265 | ------------------- |
| 266 | |
| 267 | The ROC leverages an external identity database (i.e. |
| 268 | LDAP server) to store user data such as account names and passwords for users who are able to log in to the ROC. |
| 269 | This LDAP server also has the capability to associate users with groups, for example adding ROC administrators to |
| 270 | ONFAetherAdmin would be a way to grant those people administrative privileges within the ROC. |
| 271 | |
| 272 | An external authentication service (DEX) is used to authenticate the user, handling the mechanics of accepting the |
| 273 | password, validating it, and securely returning the group the user belongs to. |
| 274 | The group identifier is then used to grant access to resources within the ROC. |
| 275 | |
| 276 | The ROC leverages Open Policy Agent (OPA) as a framework for writing access control policies. |
| 277 | |
| 278 | Securing Machine-to-Machine Communications |
| 279 | ------------------------------------------ |
| 280 | |
| 281 | gNMI naturally lends itself to mutual TLS for authentication, and that is the recommended way to secure |
| 282 | communications between components that speak gNMI. |
| 283 | For example, the communication between aether-config and its adapters uses gNMI and therefore uses mutual TLS. |
| 284 | Distributing certificates between components is a problem outside the scope of the ROC. |
| 285 | It’s assumed that another tool will be responsible for distribution, renewing certificates before they expire, etc. |
| 286 | |
| 287 | For components that speak REST, HTTPS is used to secure the connection, and authentication can take place using |
| 288 | mechanisms within the HTTPS protocol (basic auth, tokens, etc). |
| 289 | Oath2 and OpenID Connect are leveraged as an authorization provider when using these REST APIs. |
| 290 | |
| 291 | .. [1] |
| 292 | Adaptors are an ad hoc approach to implementing the workflow engine, |
| 293 | where they map models onto models, including the appropriate semantic |
| 294 | translation. This is what we originally did in XOS, but we prefer a |
| 295 | more structured approach for ROC. |
| 296 | |
| 297 | |
| 298 | |