The myth of software language portability

The myth of software language portability

For many open-source software projects, C is the programming language of choice. While there are occasionally good reasons for choosing C over another, higher-level programming language, many projects seem to use C because it is supposedly more “portable” than other programming languages.

The portability of C programs is often cited as one of its virtues as a development target; however, I will make a case that, upon closer inspection, C is actually not a very portable programming language at all.

For this article, I will follow Wikipedia in defining software portability as, “[the ability] to reuse the existing code instead of creating new code when moving software from [one] environment to another” (bracketed editions mine). In other words, a program is portable if you can run the same code without modification on a variety of different hardware and operating system platforms.

On the surface, C programs appear to be extremely portable. C compilers are available for practically any combination of hardware and operating system you can imagine, and some of these are of extremely good quality (e.g., GCC and LLVM, to name two).

Thanks to tools like autoconf, it seems possible to compile and install many large software packages written in C without having to modify a single line of source code.

Upon closer inspection, however, you will see that these tools give only the illusion of portability; under the covers, you are in fact compiling different code depending upon details of your platform. While it’s true that you are not modifying the code yourself, modifications—and sometimes substantial ones—are taking place quietly behind the scenes. 

Compiling and installing a typical C program from source involves typing (or having a package management program type for you) commands similar to these:# Downlod the package % wget http://package.location.com/path/to/tarball.tar.gz   # Unpack the source % tar -xzf tarball.tar.gz   # Configure % cd tarball/ % ./configure   # Compile % make   # Install % make install

In a typical package, configure is a shell script is generated by autoconf that roots around in your system to figure out which variations of the code should actually be compiled. This is necessary for a couple of reasons—for one, because different platforms may assign different meanings to a given C expression. For a trivial example, consider the following C declarations:int x; long y;

According to the ANSI C standard,* you can be sure that x can accept any integer value in the closed interval [-32767, 32767], and that y can hold any value x can, but you cannot be certain of the exact sizes of these two variables from their declarations alone.

A program that cares about the exact sizes and value ranges for these data types must therefore be modified to suit the local requirements. Typically, this is done using the C preprocessor: The configure script may spit out a config.h file containing, say,typedef int my_int_type; typedef long my_long_type;

The rest of the program uses my_int_type and my_long_type instead of int and long, and obtains the above definitions via the preprocessor directive#include “config.h”

In short: You are compiling different code depending upon how your platform defines its basic types. The fact that the preprocessor types in the changes instead of the end-user is irrelevant; the important truth is that the input to the compiler is different.

This problem is not limited to type declarations, either! Except for the basic file I/O operations provided by the C standard library, C programs must also cope with differences in the operating system interface in order to access the network, draw on the screen, play sounds, create threads, and a host of other interesting tasks.

POSIX helps a bit, by providing (somewhat) standard interfaces for other system services, but in my experience, the degree of POSIX support is inconsistent across systems. Many platforms provide most of the POSIX.1 Core Services suite, but there are plenty of compatibility problems.

As a result, configuration scripts usually wind up generating definitions like this:#define HAS_FEATURE_X #define NO_FEATURE_Y #define FEATURE_Z_VERSION 2

This seems innocent enough, until you look at what happens in the rest of the program: Throughout the code, you will fine sections like this:#ifdef HAS_FEATURE_X /* (A) … some code using feature X … */ #else /* (B) … some gross workaround using other features … */ #endif

Once again, we see that the compiler is given different code, based on platform differences. You might think you’re compiling the same file on different systems, but you’re not! On systems where Feature X is supported, the compiler gets section (A) of the code, and elsewhere it gets section (B).

You might as well just have two different copies of the file, one with (A) and the other with (B). Sure, the version with the preprocessor takes up less space on disk, but the outcome is the same: You’re compiling different code on systems with Feature X than on systems without it.

You might think this distinction doesn’t matter. After all, if I can compile the program without having to make changes myself, does it really matter if the code is the same? I argue that it does matter, for the following reasons:

Finding and fixing bugs is much more difficult in code that varies by platform. A test-case that manifests the bug on one platform may not reveal any errors on another. Since distributed development is very common in open-source projects, this is a real issue. If users can’t easily report bugs, or developers can’t easily reproduce the bugs users have reported, the project may suffer.

Auditing code for security vulnerabilities is virtually impossible. Does the program contain buffer overflows, leak sensitive data, maintain security invariants, and so forth? The only way to know is to generate every possible variation of the complete program, and conduct an audit separately. Automation of this process is blocked, because some of the code will only compile on some platforms; so even if you have a tool that can find problems, where do you run it?

Maintenance can easily cause variations to get out of sync. You wanted to add a feature to your program, but it requires editing different blocks of code for different platforms. Can you be sure that both are updated correctly, and in a timely fashion? Are your bug-fixes applied consistently across all the platform variations your program includes? Answering these questions takes up a great deal of time and energy within many open-source development projects.

It would be reasonable to conclude, from the above examples, that the real portability problem isn’t caused by C, but by the C preprocessor. Indeed, if you were to write a C program without using the preprocessor at all, the odds are good that it would be quite portable.

Such a program would also be virtually useless, however, since even the most basic features of the standard C library (such as file I/O) require inclusion of library headers, whose exact definition may (and does) vary by platform. Even among different installations of a single compiler such as GCC, the contents of headers changes across operating systems.

More importantly, the C preprocessor is part of the language; it’s specified (though mostly by example) in the ANSI C standard, and is an essential part of the C translation process. You can’t actually have C without it, and all the attendant portability issues it causes.

If you’ve read this far, I hope you are now alive to the idea that C programs, despite appearances, are rarely ever actually “portable” in the sense that they run “without modification in multiple environments.” I do not intend, by this, to argue that you shouldn’t program in C; there are some problem domains for which it’s a very good choice.

However, if you are planning to develop in C because you believed it was “portable”, I hope that you will now reconsider your thinking on the matter. The unportability of C programs is not obvious, but it is an important truth.

So how can we fix it? Well, the easy solution for new projects is to pick a different language. I have opinions on that too, but it’s a matter for another post. Actually “fixing” C would be an enormously difficult undertaking; that said, I think most of the portability problems with C arise because the language specifies too little: The sizes and formats of basic types, structures, unions, and pointer semantics are all underspecified. There is no module system, no way to resolve namespace conflicts except by rewriting code.

Preprocessor conditionals and macro expansion do not respect syntactic or semantic boundaries. The standard library is extremely primitive. There is no standard ABI, linkage format, or function call protocol (although there are some pretty stable conventions). The interpretation of declaration keywords like const, register, volatile, and restrict is so loose that you cannot predict what the compiler will do with them.

The portability of C would be greatly improved if the standard simply became more opinionated; if, as in Java, the sizes and formats of the primitive data types were clearly defined, the layout, packing, and alignment of structures specified, the preprocessor greatly restricted, and the rules for implicit type conversions simplified. Doing this now, however, would disrupt decades worth of code that was written to work around a poor standard, and would probably cause more troubles than it would solve.

Nevertheless, the take-away message here is that a program whose compiler sees different code, depending upon the environment in which it runs, is not portable. And, if you feel—as I do—that portability is a virtue worth consideration, this should concern you.