bgpd, doc, lib, zebra: nexthop-tracking in zebra

0. Introduction

This is the design specification for next hop tracking feature in
Quagga.

1. Background

Recursive routes are of the form:

   p/m --> n
  [Ex: 1.1.0.0/16 --> 2.2.2.2]

where 'n' itself is resolved through another route as follows:

   p2/m --> h, interface
  [Ex: 2.2.2.0/24 --> 3.3.3.3, eth0]

Usually, BGP routes are recursive in nature and BGP nexthops get
resolved through an IGP route. IGP usually adds its routes pointing to
an interface (these are called non-recursive routes).

When BGP receives a recursive route from a peer, it needs to validate
the nexthop. The path is marked valid or invalid based on the
reachability status of the nexthop.  Nexthop validation is also
important for BGP decision process as the metric to reach the nexthop
is a parameter to best path selection process.

As it goes with routing, this is a dynamic process. Route to the
nexthop can change. The nexthop can become unreachable or
reachable. In the current BGP implementation, the nexthop validation
is done periodically in the scanner run. The default scanner run
interval is one minute. Every minute, the scanner task walks the
entire BGP table. It checks the validity of each nexthop with Zebra
(the routing table manager) through a request and response message
exchange between BGP and Zebra process. BGP process is blocked for
that duration. The mechanism has two major drawbacks:

(1) The scanner task runs to completion. That can potentially starve
    the other tasks for long periods of time, based on the BGP table
    size and number of nexthops.

(2) Convergence around routing changes that affect the nexthops can be
    long (around a minute with the default intervals). The interval
    can be shortened to achieve faster reaction time, but it makes the
    first problem worse, with the scanner task consuming most of the
    CPU resources.

"Next hop tracking" feature makes this process event-driven. It
eliminates periodic nexthop validation and introduces an asynchronous
communication path between BGP and Zebra for route change notifications
that can then be acted upon.

2. Goal

Stating the obvious, the main goal is to remove the two limitations we
discussed in the previous section. The goals, in a constructive tone,
are the following:

- fairness: the scanner run should not consume an unjustly high amount
  of CPU time. This should give an overall good performance and
  response time to other events (route changes, session events,
  IO/user interface).

- convergence: BGP must react to nexthop changes instantly and provide
  sub-second convergence. This may involve diverting the routes from
  one nexthop to another.

3. Overview of the changes

The changes are in both BGP and Zebra modules.  The short summary is
the following:

- Zebra implements a registration mechanism by which clients can
   register for next hop notification. Consequently, it maintains a
   separate table, per (VRF, AF) pair, of next hops and interested
   client-list per next hop.

- When the main routing table changes in Zebra, it evaluates the next
   hop table: for each next hop, it checks if the route table
   modifications have changed its state. If so, it notifies the
   interested clients.

- BGP is one such client. It registers the next hops corresponding to
   all of its received routes/paths. It also threads the paths against
   each nexthop structure.

- When BGP receives a next hop notification from Zebra, it walks the
   corresponding path list. It makes them valid or invalid depending
   on the next hop notification. It then re-computes best path for the
   corresponding destination. This may result in re-announcing those
   destinations to peers.

4. Design

4.1. Modules

The core design introduces an "nht" (next hop tracking) module in BGP
and "rnh" (recursive nexthop) module in Zebra. The "nht" module
provides the following APIs:

bgp_find_or_add_nexthop() : find or add a nexthop in BGP nexthop table
bgp_find_nexthop() : find a nexthop in BGP nexthop table
bgp_parse_nexthop_update() : parse a nexthop update message coming
                              from zebra

The "rnh" module provides the following APIs:

zebra_add_rnh() : add a recursive nexthop
zebra_delete_rnh() : delete a recursive nexthop
zebra_lookup_rnh() : lookup a recursive nexthop

zebra_add_rnh_client() : register a client for nexthop notifications
                         against a recursive nexthop

zebra_remove_rnh_client(): remove the client registration for a
                            recursive nexthop

zebra_evaluate_rnh_table(): (re)evaluate the recursive nexthop table
                            (most probably because the main routing
                            table has changed).

zebra_cleanup_rnh_client(): Cleanup a client from the "rnh" module
                            data structures (most probably because the
                            client is going away).

4.2. Control flow

The next hop registration control flow is the following:

<====      BGP Process       ====>|<====      Zebra Process      ====>
                                  |
receive module     nht module     |  zserv module        rnh module
----------------------------------------------------------------------
              |                   |                  |
bgp_update_   |                   |                  |
      main()  | bgp_find_or_add_  |                  |
              |        nexthop()  |                  |
              |                   |                  |
              |                   | zserv_nexthop_   |
              |                   |       register() |
              |                   |                  | zebra_add_rnh()
              |                   |                  |

The next hop notification control flow is the following:

<====     Zebra Process    ====>|<====      BGP Process       ====>
                                |
rib module         rnh module   |     zebra module        nht module
----------------------------------------------------------------------
              |                 |                   |
meta_queue_   |                 |                   |
    process() | zebra_evaluate_ |                   |
              |     rnh_table() |                   |
              |                 |                   |
              |                 | bgp_read_nexthop_ |
              |                 |          update() |
              |                 |                   | bgp_parse_
              |                 |                   | nexthop_update()
              |                 |                   |

4.3. zclient message format

ZEBRA_NEXTHOP_REGISTER and ZEBRA_NEXTHOP_UNREGISTER messages are
encoded in the following way:

/*
 *     0                   1                   2                   3
 *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * |     AF                        |  prefix len   |
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * .      Nexthop prefix                                           .
 * .                                                               .
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * .                                                               .
 * .                                                               .
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * |     AF                        |  prefix len   |
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * .      Nexthop prefix                                           .
 * .                                                               .
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 */

ZEBRA_NEXTHOP_UPDATE message is encoded as follows:

/*
 *     0                   1                   2                   3
 *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * |     AF                        |  prefix len   |
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * .      Nexthop prefix getting resolved                          .
 * .                                                               .
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * |        metric                                                 |
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * |  #nexthops    |
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * | nexthop type  |
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * .      resolving Nexthop details                                .
 * .                                                               .
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * .                                                               .
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * | nexthop type  |
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 * .      resolving Nexthop details                                .
 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 */

4.4. BGP data structure

Legend:

/\   struct bgp_node: a BGP destination/route/prefix
\/

[ ]  struct bgp_info: a BGP path (e.g. route received from a peer)

 _
(_)  struct bgp_nexthop_cache: a BGP nexthop

   /\         NULL
   \/--+        ^
       |        :
       +--[ ]--[ ]--[ ]--> NULL
   /\           :
   \/--+        :
       |        :
       +--[ ]--[ ]--> NULL
                :
  _             :
 (_).............

4.5. Zebra data structure

rnh table:

           O
          / \
         O   O
            / \
           O   O

        struct rnh
        {
          u_char flags;
          struct rib *state;
          struct list *client_list;
          struct route_node *node;
        };

5. User interface changes

quagga# show ip nht
3.3.3.3
 resolved via kernel
 via 11.0.0.6, swp1
 Client list: bgp(fd 12)
11.0.0.10
 resolved via connected
 is directly connected, swp2
 Client list: bgp(fd 12)
11.0.0.18
 resolved via connected
 is directly connected, swp4
 Client list: bgp(fd 12)
11.11.11.11
 resolved via kernel
 via 10.0.1.2, eth0
 Client list: bgp(fd 12)

quagga# show ip bgp nexthop
Current BGP nexthop cache:
 3.3.3.3 valid [IGP metric 0], #paths 3
  Last update: Wed Oct 16 04:43:49 2013

 11.0.0.10 valid [IGP metric 1], #paths 1
  Last update: Wed Oct 16 04:43:51 2013

 11.0.0.18 valid [IGP metric 1], #paths 2
  Last update: Wed Oct 16 04:43:47 2013

 11.11.11.11 valid [IGP metric 0], #paths 1
  Last update: Wed Oct 16 04:43:47 2013

quagga# show ipv6 nht
quagga# show ip bgp nexthop detail

quagga# debug bgp nht
quagga# debug zebra nht

6. Sample test cases

     r2----r3
    /  \  /
  r1----r4

- Verify that a change in IGP cost triggers NHT
  + shutdown the r1-r4 and r2-r4 links
  + no shut the r1-r4 and r2-r4 links and wait for OSPF to come back
    up
  + We should be back to the original nexthop via r4 now
- Verify that a NH becoming unreachable triggers NHT
  + Shutdown all links to r4
- Verify that a NH becoming reachable triggers NHT
  + no shut all links to r4

7. Future work

- route-policy for next hop validation (e.g. ignore default route)
- damping for rapid next hop changes
- prioritized handling of nexthop changes ((un)reachability vs. metric
  changes)
- handling recursion loop, e.g.
   11.11.11.11/32 -> 12.12.12.12
   12.12.12.12/32 -> 11.11.11.11
   11.0.0.0/8 -> <interface>
- better statistics
Addresses upstream comments.

"show ip bgp nexthop detail" couldn't display multiple NHs due to a bug.
Fix that.

Fix reference counts for the nexthop cache entries

Signed-off-by: Pradosh Mohapatra <pmohapat@cumulusnetworks.com>
Signed-off-by: Daniel Walton <dwalton@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: Vivek Venkatraman <vivek@cumulusnetworks.com>

Fix reference counts for the nexthop cache entries.

Signed-off-by: Vivek Venkatraman <vivek@cumulusnetworks.com>

Edited-by: Paul Jakma <paul.jakma@hpe.com>
- Fix nexthop_ipv6_add defs in rib.h not having been modified with rib_ prefix.
- Remove rib_lookup_and_pushup, appears not to be used except for
  !HAVE_NETLINK && HAVE_STRUCT_IFALIASREQ case of ioctl.c::if_set_prefix,
  so it's not being used at all on platform with most testing of RIB.
diff --git a/zebra/zebra_rnh.c b/zebra/zebra_rnh.c
new file mode 100644
index 0000000..3e02812
--- /dev/null
+++ b/zebra/zebra_rnh.c
@@ -0,0 +1,603 @@
+/* Zebra next hop tracking code
+ * Copyright (C) 2013 Cumulus Networks, Inc.
+ *
+ * This file is part of GNU Zebra.
+ *
+ * GNU Zebra is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2, or (at your option) any
+ * later version.
+ *
+ * GNU Zebra is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU Zebra; see the file COPYING.  If not, write to the Free
+ * Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+ * 02111-1307, USA.
+ */
+
+#include <zebra.h>
+
+#include "prefix.h"
+#include "table.h"
+#include "memory.h"
+#include "str.h"
+#include "command.h"
+#include "if.h"
+#include "log.h"
+#include "sockunion.h"
+#include "linklist.h"
+#include "thread.h"
+#include "workqueue.h"
+#include "prefix.h"
+#include "routemap.h"
+#include "stream.h"
+#include "nexthop.h"
+
+#include "zebra/rib.h"
+#include "zebra/rt.h"
+#include "zebra/zserv.h"
+#include "zebra/redistribute.h"
+#include "zebra/debug.h"
+#include "zebra/zebra_rnh.h"
+
+#define lookup_rnh_table(v, f)		         \
+({						 \
+  struct zebra_vrf *zvrf;                        \
+  struct route_table *t = NULL;                  \
+  zvrf = zebra_vrf_lookup(v);                    \
+  if (zvrf)                                      \
+    t = zvrf->rnh_table[family2afi(f)];	         \
+  t;                                             \
+})
+
+static void free_state(struct rib *rib);
+static void copy_state(struct rnh *rnh, struct rib *rib);
+static int compare_state(struct rib *r1, struct rib *r2);
+static int send_client(struct rnh *rnh, struct zserv *client, vrf_id_t vrf_id);
+static void print_rnh(struct route_node *rn, struct vty *vty);
+
+char *
+rnh_str (struct rnh *rnh, char *buf, int size)
+{
+  prefix2str(&(rnh->node->p), buf, size);
+  return buf;
+}
+
+struct rnh *
+zebra_add_rnh (struct prefix *p, vrf_id_t vrfid)
+{
+  struct route_table *table;
+  struct route_node *rn;
+  struct rnh *rnh = NULL;
+
+  if (IS_ZEBRA_DEBUG_NHT)
+    {
+      char buf[INET6_ADDRSTRLEN];
+      prefix2str(p, buf, INET6_ADDRSTRLEN);
+      zlog_debug("add rnh %s in vrf %d", buf, vrfid);
+    }
+  table = lookup_rnh_table(vrfid, PREFIX_FAMILY(p));
+  if (!table)
+    {
+      zlog_debug("add_rnh: rnh table not found\n");
+      return NULL;
+    }
+
+  /* Make it sure prefixlen is applied to the prefix. */
+  apply_mask (p);
+
+  /* Lookup (or add) route node.*/
+  rn = route_node_get (table, p);
+
+  if (!rn->info)
+    {
+      rnh = XCALLOC(MTYPE_RNH, sizeof(struct rnh));
+      rnh->client_list = list_new();
+      route_lock_node (rn);
+      rn->info = rnh;
+      rnh->node = rn;
+    }
+
+  route_unlock_node (rn);
+  return (rn->info);
+}
+
+struct rnh *
+zebra_lookup_rnh (struct prefix *p, vrf_id_t vrfid)
+{
+  struct route_table *table;
+  struct route_node *rn;
+
+  table = lookup_rnh_table(vrfid, PREFIX_FAMILY(p));
+  if (!table)
+    return NULL;
+
+  /* Make it sure prefixlen is applied to the prefix. */
+  apply_mask (p);
+
+  /* Lookup route node.*/
+  rn = route_node_lookup (table, p);
+  if (!rn)
+    return NULL;
+
+  route_unlock_node (rn);
+  return (rn->info);
+}
+
+void
+zebra_delete_rnh (struct rnh *rnh)
+{
+  struct route_node *rn;
+
+  if (!rnh || !(rn = rnh->node))
+    return;
+
+  if (IS_ZEBRA_DEBUG_NHT)
+    {
+      char buf[INET6_ADDRSTRLEN];
+      zlog_debug("delete rnh %s", rnh_str(rnh, buf, INET6_ADDRSTRLEN));
+    }
+
+  list_free(rnh->client_list);
+  free_state(rnh->state);
+  XFREE(MTYPE_RNH, rn->info);
+  rn->info = NULL;
+  route_unlock_node (rn);
+  return;
+}
+
+void
+zebra_add_rnh_client (struct rnh *rnh, struct zserv *client, vrf_id_t vrf_id)
+{
+  if (IS_ZEBRA_DEBUG_NHT)
+    {
+      char buf[INET6_ADDRSTRLEN];
+      zlog_debug("client %s registers rnh %s",
+		 zebra_route_string(client->proto),
+		 rnh_str(rnh, buf, INET6_ADDRSTRLEN));
+    }
+  if (!listnode_lookup(rnh->client_list, client))
+    {
+      listnode_add(rnh->client_list, client);
+      send_client(rnh, client, vrf_id);
+    }
+}
+
+void
+zebra_remove_rnh_client (struct rnh *rnh, struct zserv *client)
+{
+  if (IS_ZEBRA_DEBUG_NHT)
+    {
+      char buf[INET6_ADDRSTRLEN];
+      zlog_debug("client %s unregisters rnh %s",
+		 zebra_route_string(client->proto),
+		 rnh_str(rnh, buf, INET6_ADDRSTRLEN));
+    }
+  listnode_delete(rnh->client_list, client);
+  if (list_isempty(rnh->client_list))
+    zebra_delete_rnh(rnh);
+}
+
+int
+zebra_evaluate_rnh_table (vrf_id_t vrfid, int family)
+{
+  struct route_table *ptable;
+  struct route_table *ntable;
+  struct route_node *prn;
+  struct route_node *nrn;
+  struct rnh *rnh;
+  struct zserv *client;
+  struct listnode *node;
+  struct rib *rib;
+
+  ntable = lookup_rnh_table(vrfid, family);
+  if (!ntable)
+    {
+      zlog_debug("evaluate_rnh_table: rnh table not found\n");
+      return -1;
+    }
+
+  ptable = zebra_vrf_table(family2afi(family), SAFI_UNICAST, vrfid);
+  if (!ptable)
+    {
+      zlog_debug("evaluate_rnh_table: prefix table not found\n");
+      return -1;
+    }
+
+  for (nrn = route_top (ntable); nrn; nrn = route_next (nrn))
+    {
+      if (!nrn->info)
+	  continue;
+
+      prn = route_node_match(ptable, &nrn->p);
+      if (!prn)
+	rib = NULL;
+      else
+	{
+	  RNODE_FOREACH_RIB(prn, rib)
+	    {
+	      if (CHECK_FLAG (rib->status, RIB_ENTRY_REMOVED))
+		continue;
+	      if (CHECK_FLAG (rib->flags, ZEBRA_FLAG_SELECTED))
+		break;
+	    }
+	}
+
+      rnh = nrn->info;
+      if (compare_state(rib, rnh->state))
+	{
+	  if (IS_ZEBRA_DEBUG_NHT)
+	    {
+	      char bufn[INET6_ADDRSTRLEN];
+	      char bufp[INET6_ADDRSTRLEN];
+	      prefix2str(&nrn->p, bufn, INET6_ADDRSTRLEN);
+	      if (prn)
+		prefix2str(&prn->p, bufp, INET6_ADDRSTRLEN);
+	      else
+		strcpy(bufp, "null");
+	      zlog_debug("rnh %s resolved through route %s - sending "
+			 "nexthop %s event to clients", bufn, bufp,
+			 rib ? "reachable" : "unreachable");
+	    }
+	  copy_state(rnh, rib);
+	  for (ALL_LIST_ELEMENTS_RO(rnh->client_list, node, client))
+	    send_client(rnh, client, vrfid);
+	}
+    }
+  return 1;
+}
+
+int
+zebra_dispatch_rnh_table (vrf_id_t vrfid, int family, struct zserv *client)
+{
+  struct route_table *ntable;
+  struct route_node *nrn;
+  struct rnh *rnh;
+
+  ntable = lookup_rnh_table(vrfid, family);
+  if (!ntable)
+    {
+      zlog_debug("dispatch_rnh_table: rnh table not found\n");
+      return -1;
+    }
+
+  for (nrn = route_top (ntable); nrn; nrn = route_next (nrn))
+    {
+      if (!nrn->info)
+	  continue;
+
+      rnh = nrn->info;
+      if (IS_ZEBRA_DEBUG_NHT)
+	{
+	  char bufn[INET6_ADDRSTRLEN];
+	  prefix2str(&nrn->p, bufn, INET6_ADDRSTRLEN);
+	  zlog_debug("rnh %s - sending nexthop %s event to client %s", bufn,
+		     rnh->state ? "reachable" : "unreachable",
+		     zebra_route_string(client->proto));
+	}
+      send_client(rnh, client, vrfid);
+    }
+  return 1;
+}
+
+void
+zebra_print_rnh_table (vrf_id_t vrfid, int af, struct vty *vty)
+{
+  struct route_table *table;
+  struct route_node *rn;
+
+  table = lookup_rnh_table(vrfid, af);
+  if (!table)
+    {
+      zlog_debug("print_rnhs: rnh table not found\n");
+      return;
+    }
+
+  for (rn = route_top(table); rn; rn = route_next(rn))
+      if (rn->info)
+	print_rnh(rn, vty);
+}
+
+int
+zebra_cleanup_rnh_client (vrf_id_t vrfid, int family, struct zserv *client)
+{
+  struct route_table *ntable;
+  struct route_node *nrn;
+  struct rnh *rnh;
+
+  ntable = lookup_rnh_table(vrfid, family);
+  if (!ntable)
+    {
+      zlog_debug("cleanup_rnh_client: rnh table not found\n");
+      return -1;
+    }
+
+  for (nrn = route_top (ntable); nrn; nrn = route_next (nrn))
+    {
+      if (!nrn->info)
+	  continue;
+
+      rnh = nrn->info;
+      if (IS_ZEBRA_DEBUG_NHT)
+	{
+	  char bufn[INET6_ADDRSTRLEN];
+	  prefix2str(&nrn->p, bufn, INET6_ADDRSTRLEN);
+	  zlog_debug("rnh %s - cleaning state for client %s", bufn,
+		     zebra_route_string(client->proto));
+	}
+      zebra_remove_rnh_client(rnh, client);
+    }
+  return 1;
+}
+
+/**
+ * free_state - free up the rib structure associated with the rnh.
+ */
+static void
+free_state (struct rib *rib)
+{
+  struct nexthop *nexthop, *next;
+
+  if (!rib)
+    return;
+
+  /* free RIB and nexthops */
+  for (nexthop = rib->nexthop; nexthop; nexthop = next)
+    {
+      next = nexthop->next;
+      nexthop_free (nexthop);
+    }
+  XFREE (MTYPE_RIB, rib);
+}
+
+/**
+ * copy_nexthop - copy a nexthop to the rib structure.
+ */
+static void
+rib_copy_nexthop (struct rib *state, struct nexthop *nh)
+{
+  struct nexthop *nexthop;
+
+  nexthop = nexthop_new();
+  nexthop->flags = nh->flags;
+  nexthop->type = nh->type;
+  nexthop->ifindex = nh->ifindex;
+  if (nh->ifname)
+    nexthop->ifname = XSTRDUP(0, nh->ifname);
+  memcpy(&(nexthop->gate), &(nh->gate), sizeof(union g_addr));
+  memcpy(&(nexthop->src), &(nh->src), sizeof(union g_addr));
+
+  rib_nexthop_add(state, nexthop);
+}
+
+static void
+copy_state (struct rnh *rnh, struct rib *rib)
+{
+  struct rib *state;
+  struct nexthop *nh;
+
+  if (rnh->state)
+    {
+      free_state(rnh->state);
+      rnh->state = NULL;
+    }
+
+  if (!rib)
+    return;
+
+  state = XCALLOC (MTYPE_RIB, sizeof (struct rib));
+  state->type = rib->type;
+  state->metric = rib->metric;
+
+  for (nh = rib->nexthop; nh; nh = nh->next)
+    rib_copy_nexthop(state, nh);
+  rnh->state = state;
+}
+
+static int
+compare_state (struct rib *r1, struct rib *r2)
+{
+  struct nexthop *nh1;
+  struct nexthop *nh2;
+  u_char found_nh = 0;
+
+  if (!r1 && !r2)
+    return 0;
+
+  if ((!r1 && r2) || (r1 && !r2))
+      return 1;
+
+  if (r1->metric != r2->metric)
+      return 1;
+
+  if (r1->nexthop_num != r2->nexthop_num)
+      return 1;
+
+  /* We need to verify that the nexthops for r1 match the nexthops for r2.
+   * Since it is possible for a rib entry to have the same nexthop multiple
+   * times (Example: [a,a]) we need to keep track of which r2 nexthops we have
+   * already used as a match against a r1 nexthop.  We track this
+   * via NEXTHOP_FLAG_MATCHED. Clear this flag for all r2 nexthops when you
+   * are finished.
+   *
+   * TRUE:  r1 [a,b], r2 [a,b]
+   * TRUE:  r1 [a,b], r2 [b,a]
+   * FALSE: r1 [a,b], r2 [a,c]
+   * FALSE: r1 [a,a], r2 [a,b]
+   */
+  for (nh1 = r1->nexthop; nh1; nh1 = nh1->next)
+    {
+      found_nh = 0;
+      for (nh2 = r2->nexthop; nh2; nh2 = nh2->next)
+        {
+          if (CHECK_FLAG (nh2->flags, NEXTHOP_FLAG_MATCHED))
+            continue;
+
+          if (nexthop_same_no_recurse(nh1, nh2))
+            {
+              SET_FLAG (nh2->flags, NEXTHOP_FLAG_MATCHED);
+              found_nh = 1;
+              break;
+            }
+        }
+
+      if (!found_nh)
+        {
+          for (nh2 = r2->nexthop; nh2; nh2 = nh2->next)
+            if (CHECK_FLAG (nh2->flags, NEXTHOP_FLAG_MATCHED))
+              UNSET_FLAG (nh2->flags, NEXTHOP_FLAG_MATCHED);
+          return 1;
+        }
+    }
+
+  for (nh2 = r2->nexthop; nh2; nh2 = nh2->next)
+    if (CHECK_FLAG (nh2->flags, NEXTHOP_FLAG_MATCHED))
+      UNSET_FLAG (nh2->flags, NEXTHOP_FLAG_MATCHED);
+
+  return 0;
+}
+
+static int
+send_client (struct rnh *rnh, struct zserv *client, vrf_id_t vrf_id)
+{
+  struct stream *s;
+  struct rib *rib;
+  unsigned long nump;
+  u_char num;
+  struct nexthop *nexthop;
+  struct route_node *rn;
+
+  rn = rnh->node;
+  rib = rnh->state;
+
+  /* Get output stream. */
+  s = client->obuf;
+  stream_reset (s);
+
+  zserv_create_header (s, ZEBRA_NEXTHOP_UPDATE, vrf_id);
+
+  stream_putw(s, rn->p.family);
+  stream_put_prefix (s, &rn->p);
+
+  if (rib)
+    {
+      stream_putl (s, rib->metric);
+      num = 0;
+      nump = stream_get_endp(s);
+      stream_putc (s, 0);
+      for (nexthop = rib->nexthop; nexthop; nexthop = nexthop->next)
+	if (CHECK_FLAG (nexthop->flags, NEXTHOP_FLAG_FIB))
+	  {
+	    stream_putc (s, nexthop->type);
+	    switch (nexthop->type)
+	      {
+	      case ZEBRA_NEXTHOP_IPV4:
+		stream_put_in_addr (s, &nexthop->gate.ipv4);
+		break;
+	      case ZEBRA_NEXTHOP_IFINDEX:
+	      case ZEBRA_NEXTHOP_IFNAME:
+		stream_putl (s, nexthop->ifindex);
+		break;
+	      case ZEBRA_NEXTHOP_IPV4_IFINDEX:
+	      case ZEBRA_NEXTHOP_IPV4_IFNAME:
+		stream_put_in_addr (s, &nexthop->gate.ipv4);
+		stream_putl (s, nexthop->ifindex);
+		break;
+#ifdef HAVE_IPV6
+	      case ZEBRA_NEXTHOP_IPV6:
+		stream_put (s, &nexthop->gate.ipv6, 16);
+		break;
+	      case ZEBRA_NEXTHOP_IPV6_IFINDEX:
+	      case ZEBRA_NEXTHOP_IPV6_IFNAME:
+		stream_put (s, &nexthop->gate.ipv6, 16);
+		stream_putl (s, nexthop->ifindex);
+		break;
+#endif /* HAVE_IPV6 */
+	      default:
+                /* do nothing */
+		break;
+	      }
+	    num++;
+	  }
+      stream_putc_at (s, nump, num);
+    }
+  else
+    {
+      stream_putl (s, 0);
+      stream_putc (s, 0);
+    }
+  stream_putw_at (s, 0, stream_get_endp (s));
+  return zebra_server_send_message(client);
+}
+
+static void
+print_nh (struct nexthop *nexthop, struct vty *vty)
+{
+  char buf[BUFSIZ];
+
+  switch (nexthop->type)
+    {
+    case NEXTHOP_TYPE_IPV4:
+    case NEXTHOP_TYPE_IPV4_IFINDEX:
+      vty_out (vty, " via %s", inet_ntoa (nexthop->gate.ipv4));
+      if (nexthop->ifindex)
+	vty_out (vty, ", %s", ifindex2ifname (nexthop->ifindex));
+      break;
+    case NEXTHOP_TYPE_IPV6:
+    case NEXTHOP_TYPE_IPV6_IFINDEX:
+    case NEXTHOP_TYPE_IPV6_IFNAME:
+      vty_out (vty, " %s",
+	       inet_ntop (AF_INET6, &nexthop->gate.ipv6, buf, BUFSIZ));
+      if (nexthop->type == NEXTHOP_TYPE_IPV6_IFNAME)
+	vty_out (vty, ", %s", nexthop->ifname);
+      else if (nexthop->ifindex)
+	vty_out (vty, ", via %s", ifindex2ifname (nexthop->ifindex));
+      break;
+    case NEXTHOP_TYPE_IFINDEX:
+      vty_out (vty, " is directly connected, %s",
+	       ifindex2ifname (nexthop->ifindex));
+      break;
+    case NEXTHOP_TYPE_IFNAME:
+      vty_out (vty, " is directly connected, %s", nexthop->ifname);
+      break;
+    case NEXTHOP_TYPE_BLACKHOLE:
+      vty_out (vty, " is directly connected, Null0");
+      break;
+    default:
+      break;
+    }
+  vty_out(vty, "%s", VTY_NEWLINE);
+}
+
+static void
+print_rnh (struct route_node *rn, struct vty *vty)
+{
+  struct rnh *rnh;
+  struct nexthop *nexthop;
+  struct listnode *node;
+  struct zserv *client;
+  char buf[BUFSIZ];
+
+  rnh = rn->info;
+  vty_out(vty, "%s%s", inet_ntop(rn->p.family, &rn->p.u.prefix, buf, BUFSIZ),
+	  VTY_NEWLINE);
+  if (rnh->state)
+    {
+      vty_out(vty, " resolved via %s%s",
+	      zebra_route_string(rnh->state->type), VTY_NEWLINE);
+      for (nexthop = rnh->state->nexthop; nexthop; nexthop = nexthop->next)
+	print_nh(nexthop, vty);
+    }
+  else
+    vty_out(vty, " unresolved%s", VTY_NEWLINE);
+
+  vty_out(vty, " Client list:");
+  for (ALL_LIST_ELEMENTS_RO(rnh->client_list, node, client))
+    vty_out(vty, " %s(fd %d)", zebra_route_string(client->proto),
+	    client->sock);
+  vty_out(vty, "%s", VTY_NEWLINE);
+}