by Shane Coughlan & Armijn Hemel
First published September 28, 2009 on LWN.net

This article is an opinion piece and does not contain legal advice. The authors are not lawyers.

This article examines a field called compliance engineering. Compliance engineering was pioneered by technical experts who wanted to address misuses of software, and was made famous by gpl-violations.org, FSF, and similar organizations correcting Free and Open Source Software (FOSS) license violations. The field has grown into a commercial segment with companies like Blackduck Software and consultancy firms like Loohuis Consulting offering formal services to third parties.

Rather than attempting to examine compliance engineering in all market segments and under all conditions, this article will focus on explaining some of the tools and skills required to undertake due diligence activities related to licensing and binary code in the embedded industry. It is based on the GPL Compliance Engineering Guide, which in turn is based on the experience of engineers contributing to the gpl-violations.org project.

Some of the methods described in this article may not be permitted by the DMCA or similar legislation in certain jurisdictions. It is important to stress that the goal of compliance engineering is not to reverse engineer a product so it can be resold for monetary gain, but rather to apply digital forensics to see if copyright was violated. You should consult a lawyer to find out the legal status of the engineering methods described here.

Context and confusion

The first phase of compliance engineering is not engineering. It is about about understanding the license that applies to code and understanding what that means with regards to obligations in a particular market segment. This dry art is sometimes challenging because of the culture of FOSS. FOSS has an innovative, fast moving, and diverse ecosystem. Contributors tend to be passionate about their work and about how it is released, shared, and further improved by the community as a whole. This can be something of a double-edged sword, providing exceptional engagement and occasionally an overabundance of enthusiasm in areas like software licensing or compliance.

The gpl-violations.org project enforces the copyright of Harald Welte and other Linux kernel developers, and has a mechanism for third parties to report suspected issues with use of Linux and related GPL code. One of the most common false positives reported is that companies are violating the GNU GPL version 2 by providing a binary firmware release for embedded devices without shipping source code in the package or offering it on a website for download. This highlights a misunderstanding regarding what the GPL requires. It is true that the GPL comes into effect when distributing code and that offering a binary firmware for download is distribution, but compliance with the license terms is more subtle than it may appear to parties who have not read the license carefully.

In the GPLv2 license there is no requirement for source code to be provided in the product package or on a website to ensure compliance. Instead, in sections 3a and 3b of the GPLv2 license there are two options regarding source code available to people distributing binary versions of licensed software. One is to accompany a product with the source code and the other is to include a written offer to supply the source code to any third party for three years. When someone gets a device with GPLv2 code and wants to check compliance, they need to look for accompanying source or a written offer on the manual, the box, a separate leaflet, web interface menus and any interactive menus.

It gets a little more complex when you consider that the above constitutes only the terms applying to source code. Finding source code or a written offer for it does not constitute GPLv2 full compliance. Instead compliance depends on whether the offered source code is complete and corresponds precisely to what is on the product, if the product also shipped with a copy of the license, and what else is shipped in what way alongside the GPL code. The full text of the license spells out how the parameters of this relationship work.

Compliance engineering is an activity that requires a mixture of technical and legal skills. Practitioners have to identify false positives and negatives, and to contextualize their analysis within applicable jurisdictional constraints. This can appear daunting for parties who have a casual approach to reading licenses. However, the skills and tools applied are relatively simple as long as a balanced approach is taken when understanding what is explicitly required in a license and what is actually present in a product. Given these two skills anyone can help make sure that people who use GPL or other FOSS licenses are adhering to the terms the copyright holders selected.

The nuts and bolts

Compliance engineers in organizations like gpl-violations.org do not have an extensive toolset. In the embedded market the product from a software perspective is a firmware image, and this is just a compilation of binary code. The contents may include everything needed to power an embedded device (bootloader, plus operating system) or just updates to certain parts of the embedded device software.

Checking if firmware is meeting the terms of a license like the GPLv2 requires the application of knowledge and a sequence of tests such as extracting visible strings from binary files and correlating them to source code. One aspect is identifying GPL software components and making sure they are included in source releases, and another requires opening the device to get physical access to serial ports. The only essential tools required are a Linux machine, a good editor, binutils, util-linux, and the ability to mount file systems over loopback or tools like unsquashfs to unpack file systems to disk.

Opening firmware

The most common operating systems for embedded devices today are Linux-kernel based or VxWorks. There are a few specialized operating systems and variants of BSD available in the market, but they are becoming less common. Linux-based firmware nearly always contains the kernel itself, one or more file systems, and sometimes a bootloader.

The quickest way to find file systems or kernels in a firmware is to search for padding. Padding usually consists of NOP characters such as zeroes which fill up space. This ensures that the individual components of a firmware are at the right offsets. The bootloader uses these offsets to quickly jump to the location of the kernel or a file system. Therefore if you see padding there will either be something following it, or it marks the end of the file. Once you have identified the components you will know what type of firmware you are dealing with, what’s in there on the architecture level, and (with a little bit of experience) what’s likely to be problematic with regards complete source code releases.

If you can’t find any padding in the firmware then another method is to look for strings like “done, booting the kernel“, as these indicate that something else will follow immediately afterwards. This method is a little more tricky and involves things like searching for markers that indicate compression (gzip header, bzip2 header, etc.), a file system (squashfs header, cramfs header, etc.), and so on. The quickest way to do this is to use hexdump -C and search for headers. Detailed information about headers is already available on most Linux systems in /usr/share/magic.

Problems you can encounter

The techniques employed for compliance engineering are essentially the same as those employed for debugging an embedded system. While this means the basic knowledge is easy to obtain, but it also means that issues can arise when the tools you are attempting to apply are different from the tools used for designing and building the system in the first place:

  • Encryption: Some devices have a firmware image that is encrypted. The bootloader decrypts it during boot time with a key that is stored in the device. Unless you know the decryption key it is impossible to take these devices apart by looking at the firmware only. Examples are ADSL modem/routers which are based on the Broadcom bcm63xx chipset. There are also companies that encrypt their firmware images using a simple XOR. It is often quite easy to find these if you see patterns that repeat themselves very often.
  • Code changes: Sometimes slight changes were made to the file system code in the kernel, which make it hard or even impossible to mount a file system over loopback without adapting a kernel driver. Examples include Broadcom bcm63xx-based devices and devices based on the Texas Instruments AR7 chipset, which both use SquashFS implementations with some modifications to either the LZMA compression (AR7) or the file system code.

To explore what code is present in these cases you need network access or even physical access to the device.

Network scanning

With portscanners like nmap you can make a fairly accurate guesstimate of what a certain device is running by using fingerprinting: many network stacks respond slightly differently to different network packets. While a fingerprint is not enough to use as evidence, scanning can give you useful information, like which TCP ports are open and which services are running. Surprisingly often you can still find a running telnet daemon which will give you direct access to the device. Sometimes exploiting bugs in the web interface also allow you to download or transfer individual files or even the whole (decrypted) file system.

Physical access

Most embedded devices have a serial port, and this is sometimes the only way to find violations. This may not be visible and sometimes is only present as a series of solder pads on the internal board. After adding pin headers you can connect a serial port to the device and – perhaps with the addition of a voltage level shifter – attach the device to a PC. Projects like OpenWrt have a lot of hardware information on their website and this can be useful in working out how to start.

Once physical access is granted things get easier. The bootloader is usually configured to be accessible via the serial port for maintenance work such as uploading a new firmware, and this often translates into a shell starting via the serial port after device initialization. Many devices are shipped with GPL licensed bootloaders, such as RedBoot, u-boot, and others. The bootloader often comes preloaded on a device and is not included in firmware updates because the firmware update only overwrites parts of the flash and leaves the bootloader alone. More problematically, the bootloader may not be included in the source packages released by the vendor, as they overlook its status as GPL code.

Example: OpenWrt firmware

GPL compliance engineering is best demonstrated using a concrete example. In this example we will take apart a firmware from the OpenWrt project. OpenWrt is a project that makes a kit to build alternative firmwares for routers and some storage devices. There are prebuilt firmwares (as well as sources) available for download from the OpenWrt website. In this example we have taken firmware 8.09.1 for a generic brcm47xx device (openwrt-brcm47xx-squashfs.trx).

Running the strings command on the file seems to return random bytes, but if you look a bit deeper there is structure. The hexdump tool has a few options which come in really handy, such as -C which displays the hexadecimal offset of the file, the characters in hexadecimal notation and the ASCII representation of those characters, if available.

A trained eye will spot that at hex offset 0x001c there is the start of a gzip header, starting with the hex values 0x1f 0x8b 0x08:

    $ hexdump -C openwrt-brcm47xx-squashfs.trx
    00000000  48 44 52 30 00 10 22 00  28 fa 8b 1c 00 00 01 00  |HDR0..".(.......|
    00000010  1c 00 00 00 0c 09 00 00  00 d4 0b 00 1f 8b 08 00  |................|
    00000020  00 00 00 00 02 03 8d 57  5d 68 1c d7 15 fe e6 ce  |.......W]h......|
    ...

Extracting can be done using an editor, or easier with dd:

    $ dd if=openwrt-brcm47xx-squashfs.trx of=tmpfile bs=4 skip=7

This command reads the file openwrt-brcm47xx-squashfs.trx and outputs it to another file, skipping the first 28 bytes.

    $ file tmpfile
    tmpfile: gzip compressed data, from Unix, max compression

With zcat this file can be uncompressed to standard output and redirected to another file:

    $ zcat tmpfile > foo

The result in this particular case is not a Linux kernel image or a file system, but the LZMA loader used to uncompress the LZMA compressed kernel that is used by OpenWrt. LZMA does not always use the same headers for compressed files, which makes it quite easy to miss. In this case the LZMA compressed kernel can be found at offset 0x090c.

    $ dd if=openwrt-brcm47xx-squashfs.trx of=kernel.lzma bs=4 skip=579

Unpacking the kernel can be done using the lzma tool.

    $ lzma -cd kernel.lzma > bar

Running the strings tool on the result quite clearly shows strings from the Linux kernel.

In openwrt-brcm47xx-squashfs.trx you can see padding in action around hex offset 0x0bd280, immediately followed by a header for a little endian SquashFS file system.

    $ hexdump -C openwrt-brcm47xx-squashfs.trx
    ...
    000bd270  1d 09 36 96 85 67 df 8f  1b 25 ff c0 f8 ed 90 00  |..6..g...%......|
    000bd280  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    *
    000bd400  68 73 71 73 9b 02 00 00  00 c6 e1 e2 d1 2a 00 00  |hsqs.........*..|
    ...

    $ dd if=openwrt-brcm47xx-squashfs.trx of=squashfs bs=16 skip=48448

From just the header of the file system it is not obvious which compression method is used:

    $ file squashfs
    squashfs: Squashfs filesystem, little endian, version 3.0, 1322493 bytes,
    667 inodes, blocksize: 65536 bytes, created: Tue Jun  2 01:40:40 2009

The two most used compression techniques are zlib and LZMA, the latter becoming more popular quickly. Unpacking with the unsquashfs tool will give an error:

    zlib::uncompress failed, unknown error -3

This indicates that probably LZMA compression is used instead of zlib. Unpacking requires a version of unsquashfs that can handle LZMA. The OpenWrt source distribution contains all necessary configuration and buildscripts to fairly easily build a version of unsquashfs with LZMA support.

The OpenWrt example is fairly typical for real cases that are handled by gpl-violations.org, where unpacking the firmware is usually the step that takes the least effort, often just taking a few minutes. Matching the binary files to sources and correct configuration information and verifying that the sources and binaries match is a process that takes a lot more time.

In conclusion

Compliance engineering is a demanding and occasionally tedious aspect of the software field. Emotion has little place in the analysis applied and the rewards of volunteer work are not visible to most people. Yet compliance engineering is also essential, providing as it does a clear imperative for people to obey the terms of FOSS licenses. It contributes part of the certainty and stability necessary for diverse stakeholders to work together on common code, and it allows a clear mechanism for discovering which parties are misunderstanding their obligations as part of the broader ecosystem. Transactions between individuals, projects and businesses cannot be sustained without such mechanisms.

It is important to remember that the skills involved in compliance engineering are not necessarily limited to a small subset of consultants and companies. Documents like the GPL Compliance Engineering Guide describe how to dig through binary code suspected of issues. Engineers from all aspects of FOSS can contribute assistance to a project or business when it comes to forensic analysis or due diligence, and they can report any issues discovered to the copyright holders or to entities like FSF’s Free Software Licensing and Compliance Lab, gpl-violations.org, FSFE’s Freedom Task Force and Software Freedom Law Center.

About the authors

Armijn Hemel is a technology consultant with Loohuis Consulting in The Netherlands and the primary engineer for the gpl-violations.org project.

Shane Coughlan is a business and technology consultant with Opendawn in Japan. He is an expert in Free/Open Source Software licensing, standardization, communication methods and business development