Tuesday, August 25, 2015

Linux bridging of VLAN interfaces and bridge IDs

Or, there can be only one

Linux has been offering a decent set of network bridging features for quite some time now. It may not always boast the absolutely latest and greatest of network protocols, but it gets the job done, and does so with commendable stability. I have had literally hundreds (possibly thousands) of Linux software bridges running in rather complex meshed network topologies for extended periods of time with no issues directly related to the implementation.
Basic bridge configuration under Linux is a reasonably simple affair, provided you have some basic understating of network protocols. Tutorials are available for a number of popular distributions. Delving deeper and handling specific, more complex situations, is, however, a bit more demanding. With the advent of virtualization technologies, Linux bridging became increasingly common, as most host (and guest) operating systems had to implement it in some form, in order to enable relatively unimpeded access to the outside network for all involved parties. This resulted in a slew of bug reports (usually against libvirt) and different solutions for some commonly encountered issues, one of which stems from the way the Bridge ID is (re)calculated when individual ports are dynamically added or removed from the bridge. See, for example, this report.
The bridge ID is an 8-byte value used to uniquely identify a bridge within a network running Spanning Tree Protocol (STP). The STP itself is not mandatory, unless you have a meshed topology (or, actually, any topology that can include closed loops). In that case, STP becomes a must, if you want to prevent the dreaded switching loops and accompanying broadcast storms and network outages. The Bridge ID's interpretation has historically developed from 2 bytes of priority and 6 bytes of lowest of bridge ports' MAC address into a more complex form, a process nicely illustrated here. Note that the "extension", or, more accurately, reinterpretation of the first 2 bytes of Bridge ID to include the Extended system ID was made to cater for different VLANs, so that the same physical bridge/port combination would be mapped to a separate bridge ID, depending on the VLAN it is a member of. The Linux bridge control command line utility, brctl, doesn't make the distinction between old and new interpretations at all, instead allowing one to explicitly control all 2 bytes (16 bits) or the priority/extended ID field, using the brctl setbridgeprio syntax. Important thing to note is that, for STP to function correctly (also known as: to avoid all of the ugliest demons of hell to come forth, spawn offspring with the nastiest of the gremlins and then let them loose on your network), bridge IDs have to be unique! This is why MAC address is used as a part of Bridge ID (those are guaranteed to, or at least assumed to be globally unique), and VLAN IDs were added in later to cover the cases where the same port MACs are used to carry the traffic for otherwise distinct VLANs.
The issue with Linux bridging setup (or, at least, its Centos 6.x implementation) arises when you want to bridge traffic from different VLANs, after the VLANs have been terminated and the tags stripped off. A common use case would be a L2 separation of different traffic groups within one physical Ethernet trunk. The way this is commonly set up under Linux is (shown symmetrically for clarity):

A simple VLAN trunking
The pair of eth0s form a trunk that carries three distinct VLANs (1, 2 and 3, using default naming), but the eth0.x endpoints give us untagged frames belonging only to the appropriate VLANs. This ensures traffic isolation between each of the VLANs, and enables us to do whatever we want with each of the eth0.x endpoints -- give them arbitrary IP addresses, bridge them further, etc. For more examples and a more in-depth explanation of VLAN bridging on Linux, see this article. So far, so good. One thing to note here is that only two distinct, real MACs are in play: those belonging to eth0 interfaces on both systems. All of the eth0.x (virtual) interfaces share the same MAC as their physical trunk interface they are derived from. Now, what we might want to do once we have the VLAN trunking configured as above is, dynamically connect some other actual (physical) interfaces through our VLAN pipes:

Passing external traffic through our VLANs
We want to pipe network traffic between ethA and ethX, and between ethB and ethY, through Systems 1 and 2. One way to do it is by bridging ethA and ethX to input/output points for VLAN 2, eth0.2, and doing similar with ethB and ethY via VLAN 3 (eth 0.3). While this configuration might be somewhat unusual, the ability of Linux bridges to add and remove individual ports on the fly might be worth the effort if the setup needs to be dynamic in nature. Let's assume that we have to run STP on all bridges (both br2s and br3s), because a loop might be present between ethA/ethX and ethB/ethY pairs, in the topology outside of systems 1 and 2. Then, if the numerical value of eth0's MAC on either side is less than corresponding values for ethA and ethB (ethX and ethY for system 2), both br2 and br3 will end up with the same Bridge ID! This happens because the algorithm for bridge ID calculation will go through all the ports that belong to a bridge and pick the MAC with the smallest numerical value as the MAC address portion of the bridge ID. But, in our case, both eth0.2 and eth0.3 ports have the same MAC, copied from the actual eth0 MAC. If it is actually smaller than the other bridge port's MAC, it will be used, leading to multiple bridges with the same Bridge IDs. Afaik, at least on Centos 6.x, no automagic effort will be made by the command line tools or system configuration scripts to use the VLAN ID portion of bridged ports and configure the Extended system ID of bridges they are a part of. This can be overcome by issuing appropriate brctl setbridgeprio commands to split bridge IDs, of course. Centos 6 network configuration scripts don't allow for setting up the priority of bridges, but that ability can be added easily enough, by editing /etc/sysconfig/network-scripts/ifup-eth file to include something along the lines of:

[ -n "${PRIO}" ] && /usr/sbin/brctl setbridgeprio ${DEVICE} ${PRIO}

in the bridge configuration section. After that change, any PRIO=... settings in persistent network configuration files will be applied, whenever the device in question is a bridge. Other distributions may already support bridge priority setting at boot time (well, actually, at network bringup time), or would require a somewhat different solutions to add the feature.
The other issue related to the Bridge IDs and the extended ID field is somewhat more obfuscated. The Kernel version powering Centos 6 series (2.6.32) and, quite possibly, the latest 4.x Kernel versions too, have a quirk when it comes to updating STP settings in a reaction to a change in one of the bridge's port's. The change in question might be adding or removing the entire port from the bridge, or the change of port's MAC address, while it is a member of a bridge. As already explained, this should trigger a set of recalculations, because the smallest numerical MAC value of all the bridge's ports might have changed. The Kernel includes a function called br_stp_change_bridge_id(). It is provided with the new, smallest, numerical MAC value to apply to the bridge, and then it does two things: first it updates the bridge's own MAC address (correctly), and then goes through all of the bridge's ports, checking to see if the old Bridge ID (prior to the MAC change) was the root and/or designated bridge for any of the ports. If so, it updates those values as well. A piece of code that does the second task is:

list_for_each_entry(p, &br->port_list, list) {
    if (ether_addr_equal(p->designated_bridge.addr, oldaddr))
    memcpy(p->designated_bridge.addr, addr, ETH_ALEN);

    if (ether_addr_equal(p->designated_root.addr, oldaddr))
        memcpy(p->designated_root.addr, addr, ETH_ALEN);
    }
}

The oldaddr is the original MAC address, addr is the new address. The ether_addr_equal()function compares the two MAC addresses and returns true if they are equal. Older Kernel versions use the !compare_ether_addr() instead, but the effect is the same. Can you spot the problem?
Having concluded that several bridges on the same system can, in fact, share the MAC portion of their Bridge IDs, we can see that the condition for updating the root and/or designated bridge is too broad: comparing the MAC portions alone might lead to false positives. For example, if two bridges with IDs 8000:00:11:22:33:44:55 and 8001:00:11:22:33:44:55 existed on the same system, updating the MAC of the second bridge from 00:11:22:33:44:55 to 00:66:77:88:99:AA would cause any ports that had root and/or designated bridge as 8000:00:11:22:33:44:55 to change those values to 8000:00:66:77:88:99:AA - a non-existing bridge. In this case, we would have to compare the full Bridge IDs, like so:

bridge_id old_bridge_id;
(...)
memcpy(&old_bridge_id, &br->bridge_id, sizeof(old_bridge_id));
(...)
list_for_each_entry(p, &br->port_list, list) {
    if (!memcmp(&p->designated_bridge, &old_bridge_id, sizeof(old_bridge_id)))
        memcpy(p->designated_bridge.addr, addr, ETH_ALEN);

    if (!memcmp(&p->designated_root, &old_bridge_id, sizeof(old_bridge_id)))
        memcpy(p->designated_root.addr, addr, ETH_ALEN);
}

Whether this is something to be concerned about, I am not sure. Given the scope of Linux network stack deployment in real world, I guess that even this (very) particular scenario would show up sooner or later, prompting a fix. On the other hand, if you're comfortable with rolling your own bridging Kernel module.. it never hurts to be (a bit more) sure.


Thursday, August 20, 2015

Eclipse rendering issues on Fedora 22 Mate

Mate is your best mate in VM Fedora

I've learned the hard way that using modern, 3D-accelerated, effects-blazing desktop environments inside VirtualBox Linux guests is not something one could hope to result in any semblance of speed. I cannot help but wonder if that is a very unusual use-case, because is seldom covered by any of the mandatory Linux news articles praising the latest iteration of Gnome3, KDE5 or even Cinnamon. Anyway, when I decided to try out the latest Fedora, I was happy to notice the official Mate spin is available for download. As the blurb on the linked page says, Mate aims at high productivity and performance (and aging, nostalgic, GTK2 crowd), although you can switch to the built-in Compiz window manager to get some bling from it. I would stick to the default, simple, but very usable and fast, Marco window manager. Even with Virtual Box's haphazard support for graphics acceleration in Linux guests, it would work nicely on my trusty i5.
Things went well, Fedora 22 Mate was installed and fully updated inside the bleeding edge Virtual Box (5.0.2), together with guest additions. The Fedora graphical installer still makes no sense whatsoever, unless you were born on a planet where GUI designers always had key confirmation buttons like Ok and Done tucked away in top corners of their forms, but at least it worked ok with all-default settings. Installing Fedora without having to analyze the latest rage in its default partitioning scheme is a major benefit of using a virtual machine. Had this been a bare metal, or, even worse, a not-so-bare metal (i.e. with some OSes already installed on it) installation, I'd have spent much more time on research. I still remember the time when Fedora 2 happily munched up the MBR of my work PC when I was trying to set up a dual-boot with Windows.

So far, so good. The desktop was fast and responsive, and, after the Virtual Box guest additions were added, resized properly on the fly, together with the VM window. I wanted to do some C/C++ development, so after pulling in and compiling some stuff in command line, I decided to install Eclipse. Now that is one package I cannot help but have love-hate relationship with. On the one hand, it is hugely powerful, with plugins supporting many programming languages and code sharing and maintenance aspects, flexible remote execution and debugging, etc. It was adopted and adapted for use by many commercial vendors (see, for example, this one). In a lot of ways it is a de-facto standard, especially in the field of cross-platform IDEs. On the other hand, it is sometimes insanely slow, difficult to properly configure, subject to arbitrary vanishing of previously working plugins, can bleed memory like a stuck pig (and lasts about as much, when it happens) and is frustratingly prone to errors in configuration files caused by improper shutdowns. I guess that with great power comes great respons... quirkiness and this is where virtual machines come into play. Once you set up Eclipse (and the rest of the development environment) just the way you like it, you can make a copy of the VM and share it with other developers, knowing that it will work on their boxes without much/any additional begging and hair-tearing.

..or is it?

Fedora 22 comes with Eclipse 4.5 (Mars), a brand new edition that I have never used until now. Much to my dismay, I have discovered that it now grows fur when massaged.
You can read that last sentence again, until it sinks in. I'll wait.
Still not convinced? Ok, let me illustrate the point: 

Furry fonts as featured in Eclipse 4.5 Mars on Fedora 22 Mate










Notice the blurry/bold fonts in Project Explorer and Outline windows, the the left and right? No, it's not a feature. Well, a furry feature perhaps. Here's the zoomed up version:

In all of their furry glory





To top it off, the fonts aren't like this (furry) when you start the application. It only happened when I scrolled up and down the Project Explorer window a few times. The more I scrolled, the.. furrier they got. Doing something that would repaint the window, like sizing or moving around, got rid of the effect temporarily.
At this point, the look on my face could have been, pretty accurately, described as o.O -- I am used to various idiosyncrasies of fresh Linux distributions, but I must admit that seeing text rendered and then re-rendered over itself with slight offset when scrolling through a common tree-view control is new. After a while, I sighed and resigned myself to watching the fleeting fun furry fonts until the next update hit. Surely, people have been complaining about this? Onwards and upwards! I went on to install a Mercurial version control plugin for Eclipse. Since it is not, apparently, in any of the official repositories, I would have to add it directly through Eclipse, and it turned out that Eclipse now has a cool new feature called the Marketplace to help with the, sometimes cumbersome, process of adding new plugins directly into Eclipse. The Marketplace, itself, is in the repositories, and it's called eclipse-mpc (really?), so one short command and a few prompts later:

dnf install eclipse-mpc

..we should be in business! Restarting Eclipse and going to Help/Eclipse Marketplace opens up a new marketplace window that promptly proceeds to pull newest plugin data from their servers... and then does nothing.

No plugins listed, not even a single furry font worth of them

Um, ok. Now this is frustrating. I've had my own share of problems with Eclipse accessing the web through firewalls and proxies, but I don't think this is one such case: it seemed to read the pages ok (a progress bar showed it downloading the info), and the old, manual ways of adding plugins still kinda works, even though it needs the Internet access too.
Thinking it might be a rendering issue (again), since it was clearly broken in this version after all, I went to google spree, which rewarded me with this gem. Apparently, people were having similar issues (but not completely the same, check out the gallery of attachments to that bug report for extra hilarity) for more than a year. Now, nowhere does it mention the Mate desktop environment, but the workaround proposed in comment #21 suddenly made a lot of sense. It proposed adding an obscure Eclipse startup setting in eclipse.ini file (located at /usr/lib64/eclipse/eclipse.ini on 64-bit Fedora 22) which apparently forces the use of GTK version 2 by the launcher:

--launcher.GTK_version
2
 
Mate is GTK2-based, with GTK3 support still experimental, which is why that rang a bell. Honestly, I have no idea if this is definitely related to the choice of Mate as a desktop environment, and I'm not installing Gnome3 desktop, just to see if the case could be repeated there, too. Having said that, I figure people would actually report this a lot, if it were happening in the default Fedora 22 Workstation setup, which uses Gnome3. It also has something to do with the version of Eclipse, since people that did complain about it in that bugzilla thread mentioned that it came about after upgrading to Eclipse Mars 4.5 It might be that Mars simply changed the default for GTK version from 2 to 3, and it broke some setups.
The workaround helped in my case, both with the furry fonts (or lack thereof), and the Marketplace finally shone in all of its glory:

Now with actual items!