It is "really hard" to make it actually useful
Lets look at an example: even though invalid type casting bugs are frequently exposed by type based alias analysis, it would not be useful to produce a warning that "the optimizer is assuming that P and P[i] don't alias" when optimizing "zero_array" (from Part #1 of our series).
float *P;
void zero_array() {
int i;
for (i = 0; i < 10000; ++i)
P[i] = 0.0f;
}
It is hard to generate these warnings only when people want them
Clang implements numerous warnings for simple and obvious cases of undefined behavior, such as out of range shifts like "x << 421". You might think that this is a simple and obvious thing, but it turns out that this is hard, because people don't want to get warnings about undefined behavior in dead code (see also the duplicates).
This dead code can take several forms: a macro that expands out in a funny way when when passed a constant, we've even had complaints that we warn in cases that would require control flow analysis of switch statements to prove that cases are not reachable. This is not helped by the fact that switch statements in C are not necessarily properly structured.
The solution to this in Clang is a growing infrastructure for handling "runtime behavior" warnings, along with code to prune these out so that they are not reported if we later find out that the block is unexecutable. This is something of an arms race with programmers though, because there are always idioms that we don't anticipate, and doing this sort of thing in the frontend means that it doesn't catch every case people would want it to catch.
Explaining a series of optimizations that exposed an opportunity
If the frontend has challenges producing good warnings, perhaps we can generate them from the optimizer instead! The biggest problem with producing a useful warning here is one of data tracking. A compiler optimizer includes dozens of optimization passes that each change the code as it comes through to canonicalize it or (hopefully) make it run faster. If the inliner decides to inline a function, this may expose other opportunities for optimizing away an "X*2/2", for example.
While I've given relatively simple and self-contained examples to demonstrate these optimizations, most of the cases where they kick in are in code coming from macro instantiation, inlining, and other abstraction-elimination activities the compiler performs. The reality is that humans don't commonly write such silly things directly. For warnings, this means that in order to relay back the issue to the users code, the warning would have to reconstruct exactly how the compiler got the intermediate code it is working on. We'd need the ability to say something like:
warning: after 3 levels of inlining (potentially across files with Link Time Optimization), some common subexpression elimination, after hoisting this thing out of a loop and proving that these 13 pointers don't alias, we found a case where you're doing something undefined. This could either be because there is a bug in your code, or because you have macros and inlining and the invalid code is dynamically unreachable but we can't prove that it is dead.
Ultimately, undefined behavior is valuable to the optimizer because it is saying "this operation is invalid - you can assume it never happens". In a case like "*P" this gives the optimizer the ability to reason that P cannot be NULL. In a case like "*NULL" (say, after some constant propagation and inlining), this allows the optimizer to know that the code must not be reachable. The important wrinkle here is that, because it cannot solve the halting problem, the compiler cannot know whether code is actually dead (as the C standard says it must be) or whether it is a bug that was exposed after a (potentially long) series of optimizations. Because there isn't a generally good way to distinguish the two, almost all of the warnings produced would be false positives (noise).
Clang's Approach to Handling Undefined BehaviorClang's first step to improve the world's code is to turn on a whole lot more warnings by default than other compilers do. While some developers are disciplined and build with "-Wall -Wextra" (for example), many people don't know about these flags or don't bother to pass them. Turning more warnings on by default catches more bugs more of the time.
The second step is that Clang generates warnings for many classes of undefined behavior (including dereference of null, oversized shifts, etc) that are obvious in the code to catch some common mistakes. Some of the caveats are mentioned above, but these seem to work well in practice.
The third step is that the LLVM optimizer generally takes much less liberty with undefined behavior than it could. Though the standard says that any instance of undefined behavior has completely unbound effects on the program, this is not a particularly useful or developer friendly behavior to take advantage of. Instead, the LLVM optimizer handles these optimizations in a few different ways (the links describe rules of LLVM IR, not C, sorry!):
int *foo(long x) {
return new int[x];
}
__Z3fool:
movl $4, %ecx
movq %rdi, %rax
mulq %rcx
movq $-1, %rdi # Set the size to -1 on overflow
cmovnoq %rax, %rdi # Which causes 'new' to throw std::bad_alloc
jmp __Znam
__Z3fool:
salq $2, %rdi
jmp __Znam # Security bug on overflow!
While (from a pedantic language lawyer standpoint) this is strictly true, we quickly learned that people do occasionally dereference null pointers, and having the code execution just fall into the top of the next function makes it very difficult to understand the problem. From the performance angle, the most important aspect of exposing these is to squash downstream code. Because of this, clang turns these into a runtime trap: if one of these is actually dynamically reached, the program stops immediately and can be debugged. The drawback of doing this is that we slightly bloat code by having these operations and having the conditions that control their predicates.
The other, more significant, limiting factor is that the warning wouldn't have any of the "tracking" information to be able to explain that an operation is the result of unrolling a loop three times and inlining it through four levels of function calls. At best we'll be able to point out the file/line/column of the original operation, which will be useful in the most trivial cases, but is likely to be extremely confusing in other cases. In any event, this hasn't been a high priority for us to implement because a) it isn't likely to give a good experience b) we won't be able to turn it on by default, and c) is a lot of work to implement.
Using a Safer Dialect of C (and other options)If writing code in a non-portable dialect of C isn't your thing, then the -ftrapv and -fcatch-undefined-behavior flags (along with the other tools mentioned before) can be useful weapons in your arsenal to track down these sorts of bugs. Enabling them in your debug builds can be a great way to find related bugs early. These flags can also be useful in production code if you are building security critical applications. While they provide no guarantee that they will find all bugs, they do find a useful subset of bugs.
Ultimately, the real problem here is that C just isn't a "safe" language and that (despite its success and popularity) many people do not really understand how the language works. In its decades of evolution prior to standardization in 1989, C migrated from being a "low level systems programming language that was a tiny layer above PDP assembly" to being a "low level systems programming language, trying to provide decent performance by breaking many people's expectations". On the one hand, these C "cheats" almost always work and code is generally more performant because of it (and in some cases, much more performant). On the other hand, the places where C cheats are often some of the most surprising to people and typically strike at the worst possible time.
C is much more than a portable assembler, sometimes in very surprising ways. I hope this discussion helps explain some of the issues behind undefined behavior in C, at least from a compiler implementer's viewpoint.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4