Pradosh Mohapatra | 60cc959 | 2015-11-09 20:21:41 -0500 | [diff] [blame] | 1 | 0. Introduction |
| 2 | |
| 3 | This is the design specification for next hop tracking feature in |
| 4 | Quagga. |
| 5 | |
| 6 | 1. Background |
| 7 | |
| 8 | Recursive routes are of the form: |
| 9 | |
| 10 | p/m --> n |
| 11 | [Ex: 1.1.0.0/16 --> 2.2.2.2] |
| 12 | |
| 13 | where 'n' itself is resolved through another route as follows: |
| 14 | |
| 15 | p2/m --> h, interface |
| 16 | [Ex: 2.2.2.0/24 --> 3.3.3.3, eth0] |
| 17 | |
| 18 | Usually, BGP routes are recursive in nature and BGP nexthops get |
| 19 | resolved through an IGP route. IGP usually adds its routes pointing to |
| 20 | an interface (these are called non-recursive routes). |
| 21 | |
| 22 | When BGP receives a recursive route from a peer, it needs to validate |
| 23 | the nexthop. The path is marked valid or invalid based on the |
| 24 | reachability status of the nexthop. Nexthop validation is also |
| 25 | important for BGP decision process as the metric to reach the nexthop |
| 26 | is a parameter to best path selection process. |
| 27 | |
| 28 | As it goes with routing, this is a dynamic process. Route to the |
| 29 | nexthop can change. The nexthop can become unreachable or |
| 30 | reachable. In the current BGP implementation, the nexthop validation |
| 31 | is done periodically in the scanner run. The default scanner run |
| 32 | interval is one minute. Every minute, the scanner task walks the |
| 33 | entire BGP table. It checks the validity of each nexthop with Zebra |
| 34 | (the routing table manager) through a request and response message |
| 35 | exchange between BGP and Zebra process. BGP process is blocked for |
| 36 | that duration. The mechanism has two major drawbacks: |
| 37 | |
| 38 | (1) The scanner task runs to completion. That can potentially starve |
| 39 | the other tasks for long periods of time, based on the BGP table |
| 40 | size and number of nexthops. |
| 41 | |
| 42 | (2) Convergence around routing changes that affect the nexthops can be |
| 43 | long (around a minute with the default intervals). The interval |
| 44 | can be shortened to achieve faster reaction time, but it makes the |
| 45 | first problem worse, with the scanner task consuming most of the |
| 46 | CPU resources. |
| 47 | |
| 48 | "Next hop tracking" feature makes this process event-driven. It |
| 49 | eliminates periodic nexthop validation and introduces an asynchronous |
| 50 | communication path between BGP and Zebra for route change notifications |
| 51 | that can then be acted upon. |
| 52 | |
| 53 | 2. Goal |
| 54 | |
| 55 | Stating the obvious, the main goal is to remove the two limitations we |
| 56 | discussed in the previous section. The goals, in a constructive tone, |
| 57 | are the following: |
| 58 | |
| 59 | - fairness: the scanner run should not consume an unjustly high amount |
| 60 | of CPU time. This should give an overall good performance and |
| 61 | response time to other events (route changes, session events, |
| 62 | IO/user interface). |
| 63 | |
| 64 | - convergence: BGP must react to nexthop changes instantly and provide |
| 65 | sub-second convergence. This may involve diverting the routes from |
| 66 | one nexthop to another. |
| 67 | |
| 68 | 3. Overview of the changes |
| 69 | |
| 70 | The changes are in both BGP and Zebra modules. The short summary is |
| 71 | the following: |
| 72 | |
| 73 | - Zebra implements a registration mechanism by which clients can |
| 74 | register for next hop notification. Consequently, it maintains a |
| 75 | separate table, per (VRF, AF) pair, of next hops and interested |
| 76 | client-list per next hop. |
| 77 | |
| 78 | - When the main routing table changes in Zebra, it evaluates the next |
| 79 | hop table: for each next hop, it checks if the route table |
| 80 | modifications have changed its state. If so, it notifies the |
| 81 | interested clients. |
| 82 | |
| 83 | - BGP is one such client. It registers the next hops corresponding to |
| 84 | all of its received routes/paths. It also threads the paths against |
| 85 | each nexthop structure. |
| 86 | |
| 87 | - When BGP receives a next hop notification from Zebra, it walks the |
| 88 | corresponding path list. It makes them valid or invalid depending |
| 89 | on the next hop notification. It then re-computes best path for the |
| 90 | corresponding destination. This may result in re-announcing those |
| 91 | destinations to peers. |
| 92 | |
| 93 | 4. Design |
| 94 | |
| 95 | 4.1. Modules |
| 96 | |
| 97 | The core design introduces an "nht" (next hop tracking) module in BGP |
| 98 | and "rnh" (recursive nexthop) module in Zebra. The "nht" module |
| 99 | provides the following APIs: |
| 100 | |
| 101 | bgp_find_or_add_nexthop() : find or add a nexthop in BGP nexthop table |
| 102 | bgp_find_nexthop() : find a nexthop in BGP nexthop table |
| 103 | bgp_parse_nexthop_update() : parse a nexthop update message coming |
| 104 | from zebra |
| 105 | |
| 106 | The "rnh" module provides the following APIs: |
| 107 | |
| 108 | zebra_add_rnh() : add a recursive nexthop |
| 109 | zebra_delete_rnh() : delete a recursive nexthop |
| 110 | zebra_lookup_rnh() : lookup a recursive nexthop |
| 111 | |
| 112 | zebra_add_rnh_client() : register a client for nexthop notifications |
| 113 | against a recursive nexthop |
| 114 | |
| 115 | zebra_remove_rnh_client(): remove the client registration for a |
| 116 | recursive nexthop |
| 117 | |
| 118 | zebra_evaluate_rnh_table(): (re)evaluate the recursive nexthop table |
| 119 | (most probably because the main routing |
| 120 | table has changed). |
| 121 | |
| 122 | zebra_cleanup_rnh_client(): Cleanup a client from the "rnh" module |
| 123 | data structures (most probably because the |
| 124 | client is going away). |
| 125 | |
| 126 | 4.2. Control flow |
| 127 | |
| 128 | The next hop registration control flow is the following: |
| 129 | |
| 130 | <==== BGP Process ====>|<==== Zebra Process ====> |
| 131 | | |
| 132 | receive module nht module | zserv module rnh module |
| 133 | ---------------------------------------------------------------------- |
| 134 | | | | |
| 135 | bgp_update_ | | | |
| 136 | main() | bgp_find_or_add_ | | |
| 137 | | nexthop() | | |
| 138 | | | | |
| 139 | | | zserv_nexthop_ | |
| 140 | | | register() | |
| 141 | | | | zebra_add_rnh() |
| 142 | | | | |
| 143 | |
| 144 | |
| 145 | The next hop notification control flow is the following: |
| 146 | |
| 147 | <==== Zebra Process ====>|<==== BGP Process ====> |
| 148 | | |
| 149 | rib module rnh module | zebra module nht module |
| 150 | ---------------------------------------------------------------------- |
| 151 | | | | |
| 152 | meta_queue_ | | | |
| 153 | process() | zebra_evaluate_ | | |
| 154 | | rnh_table() | | |
| 155 | | | | |
| 156 | | | bgp_read_nexthop_ | |
| 157 | | | update() | |
| 158 | | | | bgp_parse_ |
| 159 | | | | nexthop_update() |
| 160 | | | | |
| 161 | |
| 162 | |
| 163 | 4.3. zclient message format |
| 164 | |
| 165 | ZEBRA_NEXTHOP_REGISTER and ZEBRA_NEXTHOP_UNREGISTER messages are |
| 166 | encoded in the following way: |
| 167 | |
| 168 | /* |
| 169 | * 0 1 2 3 |
| 170 | * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 171 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 172 | * | AF | prefix len | |
| 173 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 174 | * . Nexthop prefix . |
| 175 | * . . |
| 176 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 177 | * . . |
| 178 | * . . |
| 179 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 180 | * | AF | prefix len | |
| 181 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 182 | * . Nexthop prefix . |
| 183 | * . . |
| 184 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 185 | */ |
| 186 | |
| 187 | ZEBRA_NEXTHOP_UPDATE message is encoded as follows: |
| 188 | |
| 189 | /* |
| 190 | * 0 1 2 3 |
| 191 | * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 192 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 193 | * | AF | prefix len | |
| 194 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 195 | * . Nexthop prefix getting resolved . |
| 196 | * . . |
| 197 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 198 | * | metric | |
| 199 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 200 | * | #nexthops | |
| 201 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 202 | * | nexthop type | |
| 203 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 204 | * . resolving Nexthop details . |
| 205 | * . . |
| 206 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 207 | * . . |
| 208 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 209 | * | nexthop type | |
| 210 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 211 | * . resolving Nexthop details . |
| 212 | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 213 | */ |
| 214 | |
| 215 | 4.4. BGP data structure |
| 216 | |
| 217 | Legend: |
| 218 | |
| 219 | /\ struct bgp_node: a BGP destination/route/prefix |
| 220 | \/ |
| 221 | |
| 222 | [ ] struct bgp_info: a BGP path (e.g. route received from a peer) |
| 223 | |
| 224 | _ |
| 225 | (_) struct bgp_nexthop_cache: a BGP nexthop |
| 226 | |
| 227 | |
| 228 | |
| 229 | /\ NULL |
| 230 | \/--+ ^ |
| 231 | | : |
| 232 | +--[ ]--[ ]--[ ]--> NULL |
| 233 | /\ : |
| 234 | \/--+ : |
| 235 | | : |
| 236 | +--[ ]--[ ]--> NULL |
| 237 | : |
| 238 | _ : |
| 239 | (_)............. |
| 240 | |
| 241 | |
| 242 | 4.5. Zebra data structure |
| 243 | |
| 244 | rnh table: |
| 245 | |
| 246 | O |
| 247 | / \ |
| 248 | O O |
| 249 | / \ |
| 250 | O O |
| 251 | |
| 252 | struct rnh |
| 253 | { |
| 254 | u_char flags; |
| 255 | struct rib *state; |
| 256 | struct list *client_list; |
| 257 | struct route_node *node; |
| 258 | }; |
| 259 | |
| 260 | 5. User interface changes |
| 261 | |
| 262 | quagga# show ip nht |
| 263 | 3.3.3.3 |
| 264 | resolved via kernel |
| 265 | via 11.0.0.6, swp1 |
| 266 | Client list: bgp(fd 12) |
| 267 | 11.0.0.10 |
| 268 | resolved via connected |
| 269 | is directly connected, swp2 |
| 270 | Client list: bgp(fd 12) |
| 271 | 11.0.0.18 |
| 272 | resolved via connected |
| 273 | is directly connected, swp4 |
| 274 | Client list: bgp(fd 12) |
| 275 | 11.11.11.11 |
| 276 | resolved via kernel |
| 277 | via 10.0.1.2, eth0 |
| 278 | Client list: bgp(fd 12) |
| 279 | |
| 280 | quagga# show ip bgp nexthop |
| 281 | Current BGP nexthop cache: |
| 282 | 3.3.3.3 valid [IGP metric 0], #paths 3 |
| 283 | Last update: Wed Oct 16 04:43:49 2013 |
| 284 | |
| 285 | 11.0.0.10 valid [IGP metric 1], #paths 1 |
| 286 | Last update: Wed Oct 16 04:43:51 2013 |
| 287 | |
| 288 | 11.0.0.18 valid [IGP metric 1], #paths 2 |
| 289 | Last update: Wed Oct 16 04:43:47 2013 |
| 290 | |
| 291 | 11.11.11.11 valid [IGP metric 0], #paths 1 |
| 292 | Last update: Wed Oct 16 04:43:47 2013 |
| 293 | |
| 294 | quagga# show ipv6 nht |
| 295 | quagga# show ip bgp nexthop detail |
| 296 | |
| 297 | quagga# debug bgp nht |
| 298 | quagga# debug zebra nht |
| 299 | |
| 300 | 6. Sample test cases |
| 301 | |
| 302 | r2----r3 |
| 303 | / \ / |
| 304 | r1----r4 |
| 305 | |
| 306 | - Verify that a change in IGP cost triggers NHT |
| 307 | + shutdown the r1-r4 and r2-r4 links |
| 308 | + no shut the r1-r4 and r2-r4 links and wait for OSPF to come back |
| 309 | up |
| 310 | + We should be back to the original nexthop via r4 now |
| 311 | - Verify that a NH becoming unreachable triggers NHT |
| 312 | + Shutdown all links to r4 |
| 313 | - Verify that a NH becoming reachable triggers NHT |
| 314 | + no shut all links to r4 |
| 315 | |
| 316 | 7. Future work |
| 317 | |
| 318 | - route-policy for next hop validation (e.g. ignore default route) |
| 319 | - damping for rapid next hop changes |
| 320 | - prioritized handling of nexthop changes ((un)reachability vs. metric |
| 321 | changes) |
| 322 | - handling recursion loop, e.g. |
| 323 | 11.11.11.11/32 -> 12.12.12.12 |
| 324 | 12.12.12.12/32 -> 11.11.11.11 |
| 325 | 11.0.0.0/8 -> <interface> |
| 326 | - better statistics |