Don’t Touch My EtherChannels! – CML 2.3 e1000 MAC Address Pool Bug

Cisco’s premier labbing solution, Cisco Modeling Labs (CML), just had a new release! Version 2.3 features a much improved UI, properly working subinterface support for the IOS-XE devices (hallelujah!), and a new reference platform (refplat) image that has updated images and includes the shiny new virtual Catalyst 8000 (Cat8000v) in its IOS-XE 17.6.1 release. This is alongside some other changes! This is a big frickin’ deal!

(But(t)s are nasty…and this is a particularly nasty one.)

But, it wouldn’t be a software release without some bugs, and this is a whopper of a bug! (Don’t sue me, Burger King.)

You see, Cisco decided to change their CentOS core to run on Ubuntu (based on Debian, for the Linux-uninitiated) to account for the deprecation of CentOS. While this sounds like a cool change on paper, it required a lot of changes on Cisco’s part to get CML to work on Ubuntu. Presumably, this major bug with the handling of randomizing MAC addresses slipped into the crack in the door of the e1000 driver stack implementation without anyone checking before the door was closed.

I did a post on the Cisco Learning Network community for CML about this issue. You can find my post as a comment on this thread here. You can also read the copy of the post that I included below. Happy reading!

UPDATE – I should also note that it seems that the CML development team has also picked up on this bug and has seen the write-up that I have included below. Based on a reply from what I assume to be someone fairly involved in CML’s development, it seems fairly sure that this will be resolved in an upcoming release of CML. You can see this reply with the existing “this thread here” link above.


Cisco Learning Network Community Post

Ah, darn! I was going to make a post about this, but it seems someone has already beaten me to it. In that case, I’ll expand on this with some details of my assessment of the root cause of this after some debugging and comparisons of CML 2.3 with CML 2.2.

First, some important things to remember:

  1. The IOSvL2 image uses the e1000 driver; this is the only driver type that is susceptible to this issue (N9K with the vmxnet3 driver is not). This is a strong indicator.
  2. The way that LACP/PagP selects its system ID (device ID, using PAgP terms) is by using the first two hextets of the base MAC address as part of its system ID.
  3. The MAC address of an SVI is determined by two factors: the first two hextets (see a theme?) of the base MAC address and the VLAN ID encoded in the last 3 hex characters of the MAC address.

The root cause of this issue is that CML 2.3 is using a single MAC address pool and then ensuring that the last (third) hextet is unique across nodes and interfaces; this ensures that all interfaces have unique MAC addresses. However, the first two hextets are always 5254.0000. That is problematic, because there are features that will take the two hextets and assume they are unique. This is where EtherChannel and SVI failures come into play.

While CML 2.2 randomized the second hextet, CML 2.3 randomizes the third. This means that when LACP is sending out its LACPDUs, it ends up using the same system ID on both ends, meaning that, from the switch’s perspective, the LACPDU seems to suggest that it is peering with itself. It seems that it assumes, at this point, that the LACPDU has looped around and drops it, preventing the successful formation of a dynamic EtherChannel. I’ve attached a picture of a sample packet capture of an LACPDU to demonstrate this.

Screen Shot 2022-03-18 at 8.52.48 PM

While I’m focusing specifically on LACP in my explanation, PAgP is also affected, as it uses the same mechanism to select its device ID that it, similarly, advertises in its messages to the peer.

This also messes with SVIs, because the MAC address of an SVI follows the format of: <first two hextets of base MAC>.8<VLAN ID in hex>. As such, because the first two hextets of every node in the lab is 5254.0000, the MAC address of every VLAN 1 SVI becomes 5254.0000.8001. Every VLAN 2 SVI becomes 5254.0000.8002, and so on, and so forth. This interferes with traffic attempting to traverse SVI to SVI within the same VLAN, as it interferes with the ARP process.

Here’s some command snippets demonstrating the differences in MAC address generation and how it affects the uniqueness of the MAC addresses associated with SVIs.

CML 2.2.2
 
Sw1#show span | i (Root|Bridge) ID|Address
  Root ID    Priority    32769
             Address     5254.0000.0f79
  Bridge ID  Priority    32769  (priority 32768 sys-id-ext 1)
             Address     5254.0000.0f79
Sw2#show span | i (Root|Bridge) ID|Address
  Root ID    Priority    32769
             Address     5254.0000.0f79
  Bridge ID  Priority    32769  (priority 32768 sys-id-ext 1)
             Address     5254.000b.08ff
 
Sw1#sh int vlan 1 | i bia
  Hardware is Ethernet SVI, address is 5254.0000.8001 (bia 5254.0000.8001)
Sw2#sh int vlan 1 | i bia
  Hardware is Ethernet SVI, address is 5254.000b.8001 (bia 5254.000b.8001)
CML 2.3
 
Sw1#show span | i (Root|Bridge) ID|Address
  Root ID    Priority    32769
             Address     5254.0000.0000
  Bridge ID  Priority    32769  (priority 32768 sys-id-ext 1)
             Address     5254.0000.0000
 
Sw2#show span | i (Root|Bridge) ID|Address
  Root ID    Priority    32769
             Address     5254.0000.0000
  Bridge ID  Priority    32769  (priority 32768 sys-id-ext 1)
             Address     5254.0000.0004
 
Sw1#sh int vlan 1 | i bia
  Hardware is Ethernet SVI, address is 5254.0000.8001 (bia 5254.0000.8001)
Sw2#show int vlan 1 | i bia 
  Hardware is Ethernet SVI, address is 5254.0000.8001 (bia 5254.0000.8001)

The fact that this issue appeared simultaneously with a release that switched the CML core from CentOS to Ubuntu is too chronologically satisfying to assume a coincidence of. This seems to suggest that this issue happened, at least in part, as a result of unintentional changes in the implementation of the driver stack as part of the transition. This, however, is pure speculation.

Depending on what feature you are using, there are ways to work around this. For SVIs, the solution is to configure unique MAC addresses on each SVI with the mac-address command. It is important to note that they don’t have to be unique between VLANs; they just have to be unique between the SVIs within a given VLAN, as MAC addresses are significant only within the Ethernet domain (defined by the VLAN boundaries). For LACP/PAgP, unfortunately, IOS doesn’t support changing the Sys/Dev ID used for these features. As such, your best bet is to force the EtherChannel.

A permanent solution to this issue would have to come in the form of an image update adding support for vmxnet3 or, alternatively, a fix to the e1000 driver. We will see what happens! 🙂

If you enjoyed this post, consider sharing it to others!
Default image
Kelvin Tran
Articles: 4

One comment

  1. Hello Kelvin,

    The change was huge, but needed. I tried to move forward to v2.3 but it turns out impossible. i had found that i was unable to migrate from v2.2.3, since the cml.controller service was not starting. Also randomly i lost links between the objects and of course i would expect more node types (ACI,UCS etc etc…). They adversite to keep using the v2.2.3 because its the stable version but it worth the try.
    Lets see if this post or in CML Community gather some attention to Cisco correct this behaviour.
    The workaround is quite simple…but…

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.