With Unladen Swallow going the way of the Norwegian Blue [1] [2], this PEP has been deemed to have been withdrawn.
AbstractThis PEP proposes the merger of the Unladen Swallow project [3] into CPython’s source tree. Unladen Swallow is an open-source branch of CPython focused on performance. Unladen Swallow is source-compatible with valid Python 2.6.4 applications and C extension modules.
Unladen Swallow adds a just-in-time (JIT) compiler to CPython, allowing for the compilation of selected Python code to optimized machine code. Beyond classical static compiler optimizations, Unladen Swallow’s JIT compiler takes advantage of data collected at runtime to make checked assumptions about code behaviour, allowing the production of faster machine code.
This PEP proposes to integrate Unladen Swallow into CPython’s development tree in a separate py3k-jit
branch, targeted for eventual merger with the main py3k
branch. While Unladen Swallow is by no means finished or perfect, we feel that Unladen Swallow has reached sufficient maturity to warrant incorporation into CPython’s roadmap. We have sought to create a stable platform that the wider CPython development team can build upon, a platform that will yield increasing performance for years to come.
This PEP will detail Unladen Swallow’s implementation and how it differs from CPython 2.6.4; the benchmarks used to measure performance; the tools used to ensure correctness and compatibility; the impact on CPython’s current platform support; and the impact on the CPython core development process. The PEP concludes with a proposed merger plan and brief notes on possible directions for future work.
We seek the following from the BDFL:
py3k
branch once all blocking issues [31] have been addressed.Many companies and individuals would like Python to be faster, to enable its use in more projects. Google is one such company.
Unladen Swallow is a Google-sponsored branch of CPython, initiated to improve the performance of Google’s numerous Python libraries, tools and applications. To make the adoption of Unladen Swallow as easy as possible, the project initially aimed at four goals:
We chose 2.6.4 as our baseline because Google uses CPython 2.4 internally, and jumping directly from CPython 2.4 to CPython 3.x was considered infeasible.
To achieve the desired performance, Unladen Swallow has implemented a just-in-time (JIT) compiler [51] in the tradition of Urs Hoelzle’s work on Self [52], gathering feedback at runtime and using that to inform compile-time optimizations. This is similar to the approach taken by the current breed of JavaScript engines [59], [60]; most Java virtual machines [63]; Rubinius [61], MacRuby [62], and other Ruby implementations; Psyco [64]; and others.
We explicitly reject any suggestion that our ideas are original. We have sought to reuse the published work of other researchers wherever possible. If we have done any original work, it is by accident. We have tried, as much as possible, to take good ideas from all corners of the academic and industrial community. A partial list of the research papers that have informed Unladen Swallow is available on the Unladen Swallow wiki [54].
The key observation about optimizing dynamic languages is that they are only dynamic in theory; in practice, each individual function or snippet of code is relatively static, using a stable set of types and child functions. The current CPython bytecode interpreter assumes the worst about the code it is running, that at any moment the user might override the len()
function or pass a never-before-seen type into a function. In practice this never happens, but user code pays for that support. Unladen Swallow takes advantage of the relatively static nature of user code to improve performance.
At a high level, the Unladen Swallow JIT compiler works by translating a function’s CPython bytecode to platform-specific machine code, using data collected at runtime, as well as classical compiler optimizations, to improve the quality of the generated machine code. Because we only want to spend resources compiling Python code that will actually benefit the runtime of the program, an online heuristic is used to assess how hot a given function is. Once the hotness value for a function crosses a given threshold, it is selected for compilation and optimization. Until a function is judged hot, however, it runs in the standard CPython eval loop, which in Unladen Swallow has been instrumented to record interesting data about each bytecode executed. This runtime data is used to reduce the flexibility of the generated machine code, allowing us to optimize for the common case. For example, we collect data on
a + b
is only ever adding integers, the generated machine code for that snippet will not support adding floats.foo()
callsite is always calling the same foo
function, we can optimize the call or inline it awayRefer to [55] for a complete list of data points gathered and how they are used.
However, if by chance the historically-untaken branch is now taken, or some integer-optimized a + b
snippet receives two strings, we must support this. We cannot change Python semantics. Each of these sections of optimized machine code is preceded by a guard
, which checks whether the simplifying assumptions we made when optimizing still hold. If the assumptions are still valid, we run the optimized machine code; if they are not, we revert back to the interpreter and pick up where we left off.
We have chosen to reuse a set of existing compiler libraries called LLVM [4] for code generation and code optimization. This has saved our small team from needing to understand and debug code generation on multiple machine instruction sets and from needing to implement a large set of classical compiler optimizations. The project would not have been possible without such code reuse. We have found LLVM easy to modify and its community receptive to our suggestions and modifications.
In somewhat more depth, Unladen Swallow’s JIT works by compiling CPython bytecode to LLVM’s own intermediate representation (IR) [94], taking into account any runtime data from the CPython eval loop. We then run a set of LLVM’s built-in optimization passes, producing a smaller, optimized version of the original LLVM IR. LLVM then lowers the IR to platform-specific machine code, performing register allocation, instruction scheduling, and any necessary relocations. This arrangement of the compilation pipeline allows the LLVM-based JIT to be easily omitted from a compiled python
binary by passing --without-llvm
to ./configure
; various use cases for this flag are discussed later.
For a complete detailing of how Unladen Swallow works, consult the Unladen Swallow documentation [53], [55].
Unladen Swallow has focused on improving the performance of single-threaded, pure-Python code. We have not made an effort to remove CPython’s global interpreter lock (GIL); we feel this is separate from our work, and due to its sensitivity, is best done in a mainline development branch. We considered making GIL-removal a part of Unladen Swallow, but were concerned by the possibility of introducing subtle bugs when porting our work from CPython 2.6 to 3.x.
A JIT compiler is an extremely versatile tool, and we have by no means exhausted its full potential. We have tried to create a sufficiently flexible framework that the wider CPython development community can build upon it for years to come, extracting increased performance in each subsequent release.
AlternativesThere are number of alternative strategies for improving Python performance which we considered, but found unsatisfactory.
Static compilers like these are useful for writing extension modules without worrying about reference counting, but because they are static, ahead-of-time compilers, they cannot optimize the full range of code under consideration by a just-in-time compiler informed by runtime data.
Unladen Swallow has developed a fairly large suite of benchmarks, ranging from synthetic microbenchmarks designed to test a single feature up through whole-application macrobenchmarks. The inspiration for these benchmarks has come variously from third-party contributors (in the case of the html5lib
benchmark), Google’s own internal workloads (slowspitfire
, pickle
, unpickle
), as well as tools and libraries in heavy use throughout the wider Python community (django
, 2to3
, spambayes
). These benchmarks are run through a single interface called perf.py
that takes care of collecting memory usage information, graphing performance, and running statistics on the benchmark results to ensure significance.
The full list of available benchmarks is available on the Unladen Swallow wiki [43], including instructions on downloading and running the benchmarks for yourself. All our benchmarks are open-source; none are Google-proprietary. We believe this collection of benchmarks serves as a useful tool to benchmark any complete Python implementation, and indeed, PyPy is already using these benchmarks for their own performance testing [80], [95]. We welcome this, and we seek additional workloads for the benchmark suite from the Python community.
We have focused our efforts on collecting macrobenchmarks and benchmarks that simulate real applications as well as possible, when running a whole application is not feasible. Along a different axis, our benchmark collection originally focused on the kinds of workloads seen by Google’s Python code (webapps, text processing), though we have since expanded the collection to include workloads Google cares nothing about. We have so far shied away from heavily numerical workloads, since NumPy [79] already does an excellent job on such code and so improving numerical performance was not an initial high priority for the team; we have begun to incorporate such benchmarks into the collection [96] and have started work on optimizing numerical Python code.
Beyond these benchmarks, there are also a variety of workloads we are explicitly not interested in benchmarking. Unladen Swallow is focused on improving the performance of pure-Python code, so the performance of extension modules like NumPy is uninteresting since NumPy’s core routines are implemented in C. Similarly, workloads that involve a lot of IO like GUIs, databases or socket-heavy applications would, we feel, fail to accurately measure interpreter or code generation optimizations. That said, there’s certainly room to improve the performance of C-language extensions modules in the standard library, and as such, we have added benchmarks for the cPickle
and re
modules.
The charts below compare the arithmetic mean of multiple benchmark iterations for CPython 2.6.4 and Unladen Swallow. perf.py
gathers more data than this, and indeed, arithmetic mean is not the whole story; we reproduce only the mean for the sake of conciseness. We include the t
score from the Student’s two-tailed T-test [44] at the 95% confidence interval to indicate the significance of the result. Most benchmarks are run for 100 iterations, though some longer-running whole-application benchmarks are run for fewer iterations.
A description of each of these benchmarks is available on the Unladen Swallow wiki [43].
Command:
./perf.py -r -b default,apps ../a/python ../b/python
32-bit; gcc 4.0.3; Ubuntu Dapper; Intel Core2 Duo 6600 @ 2.4GHz; 2 cores; 4MB L2 cache; 4GB RAM
64-bit; gcc 4.2.4; Ubuntu Hardy; AMD Opteron 8214 HE @ 2.2 GHz; 4 cores; 1MB L2 cache; 8GB RAM
Many of these benchmarks take a hit under Unladen Swallow because the current version blocks execution to compile Python functions down to machine code. This leads to the behaviour seen in the timeline graphs for the html5lib
and rietveld
benchmarks, for example, and slows down the overall performance of 2to3
. We have an active development branch to fix this problem ([46], [47]), but working within the strictures of CPython’s current threading system has complicated the process and required far more care and time than originally anticipated. We view this issue as critical to final merger into the py3k
branch.
We have obviously not met our initial goal of a 5x performance improvement. A performance retrospective follows, which addresses why we failed to meet our initial performance goal. We maintain a list of yet-to-be-implemented performance work [50].
Memory UsageThe following table shows maximum memory usage (in kilobytes) for each of Unladen Swallow’s default benchmarks for both CPython 2.6.4 and Unladen Swallow r988, as well as a timeline of memory usage across the lifetime of the benchmark. We include tables for both 32- and 64-bit binaries. Memory usage was measured on Linux 2.6 systems by summing the Private_
sections from the kernel’s /proc/$pid/smaps
pseudo-files [45].
Command:
./perf.py -r --track_memory -b default,apps ../a/python ../b/python
32-bit
64-bit
The increased memory usage comes from a) LLVM code generation, analysis and optimization libraries; b) native code; c) memory usage issues or leaks in LLVM; d) data structures needed to optimize and generate machine code; e) as-yet uncategorized other sources.
While we have made significant progress in reducing memory usage since the initial naive JIT implementation [42], there is obviously more to do. We believe that there are still memory savings to be made without sacrificing performance. We have tended to focus on raw performance, and we have not yet made a concerted push to reduce memory usage. We view reducing memory usage as a blocking issue for final merger into the py3k
branch. We seek guidance from the community on an acceptable level of increased memory usage.
Statically linking LLVM’s code generation, analysis and optimization libraries increases the time needed to start the Python binary. C++ static initializers used by LLVM also increase start-up time, as does importing the collection of pre-compiled C runtime routines we want to inline to Python code.
Results from Unladen Swallow’s startup
benchmarks:
$ ./perf.py -r -b startup /tmp/cpy-26/bin/python /tmp/unladen/bin/python ### normal_startup ### Min: 0.219186 -> 0.352075: 1.6063x slower Avg: 0.227228 -> 0.364384: 1.6036x slower Significant (t=-51.879098, a=0.95) Stddev: 0.00762 -> 0.02532: 3.3227x larger Timeline: http://tinyurl.com/yfe8z3r ### startup_nosite ### Min: 0.105949 -> 0.264912: 2.5004x slower Avg: 0.107574 -> 0.267505: 2.4867x slower Significant (t=-703.557403, a=0.95) Stddev: 0.00214 -> 0.00240: 1.1209x larger Timeline: http://tinyurl.com/yajn8fa ### bzr_startup ### Min: 0.067990 -> 0.097985: 1.4412x slower Avg: 0.084322 -> 0.111348: 1.3205x slower Significant (t=-37.432534, a=0.95) Stddev: 0.00793 -> 0.00643: 1.2330x smaller Timeline: http://tinyurl.com/ybdm537 ### hg_startup ### Min: 0.016997 -> 0.024997: 1.4707x slower Avg: 0.026990 -> 0.036772: 1.3625x slower Significant (t=-53.104502, a=0.95) Stddev: 0.00406 -> 0.00417: 1.0273x larger Timeline: http://tinyurl.com/ycout8m
bzr_startup
and hg_startup
measure how long it takes Bazaar and Mercurial, respectively, to display their help screens. startup_nosite
runs python -S
many times; usage of the -S
option is rare, but we feel this gives a good indication of where increased startup time is coming from.
Unladen Swallow has made headway toward optimizing startup time, but there is still more work to do and further optimizations to implement. Improving start-up time is a high-priority item [33] in Unladen Swallow’s merger punchlist.
Binary SizeStatically linking LLVM’s code generation, analysis and optimization libraries significantly increases the size of the python
binary. The tables below report stripped on-disk binary sizes; the binaries are stripped to better correspond with the configurations used by system package managers. We feel this is the most realistic measure of any change in binary size.
The increased binary size is caused by statically linking LLVM’s code generation, analysis and optimization libraries into the python
binary. This can be straightforwardly addressed by modifying LLVM to better support shared linking and then using that, instead of the current static linking. For the moment, though, static linking provides an accurate look at the cost of linking against LLVM.
Even when statically linking, we believe there is still headroom to improve on-disk binary size by narrowing Unladen Swallow’s dependencies on LLVM. This issue is actively being addressed [32].
Performance RetrospectiveOur initial goal for Unladen Swallow was a 5x performance improvement over CPython 2.6. We did not hit that, nor to put it bluntly, even come close. Why did the project not hit that goal, and can an LLVM-based JIT ever hit that goal?
Why did Unladen Swallow not achieve its 5x goal? The primary reason was that LLVM required more work than we had initially anticipated. Based on the fact that Apple was shipping products based on LLVM [81], and other high-level languages had successfully implemented LLVM-based JITs ([61], [62], [82]), we had assumed that LLVM’s JIT was relatively free of show-stopper bugs.
That turned out to be incorrect. We had to turn our attention away from performance to fix a number of critical bugs in LLVM’s JIT infrastructure (for example, [83], [84]) as well as a number of nice-to-have enhancements that would enable further optimizations along various axes (for example, [86], [85], [87]). LLVM’s static code generation facilities, tools and optimization passes are stable and stress-tested, but the just-in-time infrastructure was relatively untested and buggy. We have fixed this.
(Our hypothesis is that we hit these problems – problems other projects had avoided – because of the complexity and thoroughness of CPython’s standard library test suite.)
We also diverted engineering effort away from performance and into support tools such as gdb and oProfile. gdb did not work well with JIT compilers at all, and LLVM previously had no integration with oProfile. Having JIT-aware debuggers and profilers has been very valuable to the project, and we do not regret channeling our time in these directions. See the Debugging and Profiling sections for more information.
Can an LLVM-based CPython JIT ever hit the 5x performance target? The benchmark results for JIT-based JavaScript implementations suggest that 5x is indeed possible, as do the results PyPy’s JIT has delivered for numeric workloads. The experience of Self-92 [52] is also instructive.
Can LLVM deliver this? We believe that we have only begun to scratch the surface of what our LLVM-based JIT can deliver. The optimizations we have incorporated into this system thus far have borne significant fruit (for example, [88], [89], [90]). Our experience to date is that the limiting factor on Unladen Swallow’s performance is the engineering cycles needed to implement the literature. We have found LLVM easy to work with and to modify, and its built-in optimizations have greatly simplified the task of implementing Python-level optimizations.
An overview of further performance opportunities is discussed in the Future Work section.
Correctness and CompatibilityUnladen Swallow’s correctness test suite includes CPython’s test suite (under Lib/test/
), as well as a number of important third-party applications and libraries [6]. A full list of these applications and libraries is reproduced below. Any dependencies needed by these packages, such as zope.interface
[34], are also tested indirectly as a part of testing the primary package, thus widening the corpus of tested third-party Python code.
These applications pass all relevant tests when run under Unladen Swallow. Note that some tests that failed against our baseline of CPython 2.6.4 were disabled, as were tests that made assumptions about CPython internals such as exact bytecode numbers or bytecode format. Any package with disabled tests includes a README.unladen
file that details the changes (for example, [37]).
In addition, Unladen Swallow is tested automatically against an array of internal Google Python libraries and applications. These include Google’s internal Python bindings for BigTable [35], the Mondrian code review application [36], and Google’s Python standard library, among others. The changes needed to run these projects under Unladen Swallow have consistently broken into one of three camps:
int
to Py_ssize_t
and similar API changes.Testing against this wide range of public and proprietary applications and libraries has been instrumental in ensuring the correctness of Unladen Swallow. Testing has exposed bugs that we have duly corrected. Our automated regression testing regime has given us high confidence in our changes as we have moved forward.
In addition to third-party testing, we have added further tests to CPython’s test suite for corner cases of the language or implementation that we felt were untested or underspecified (for example, [48], [49]). These have been especially important when implementing optimizations, helping make sure we have not accidentally broken the darker corners of Python.
We have also constructed a test suite focused solely on the LLVM-based JIT compiler and the optimizations implemented for it [38]. Because of the complexity and subtlety inherent in writing an optimizing compiler, we have attempted to exhaustively enumerate the constructs, scenarios and corner cases we are compiling and optimizing. The JIT tests also include tests for things like the JIT hotness model, making it easier for future CPython developers to maintain and improve.
We have recently begun using fuzz testing [39] to stress-test the compiler. We have used both pyfuzz [40] and Fusil [41] in the past, and we recommend they be introduced as an automated part of the CPython testing process.
Known IncompatibilitiesThe only application or library we know to not work with Unladen Swallow that does work with CPython 2.6.4 is Psyco [64]. We are aware of some libraries such as PyGame [78] that work well with CPython 2.6.4, but suffer some degradation due to changes made in Unladen Swallow. We are tracking this issue [47] and are working to resolve these instances of degradation.
While Unladen Swallow is source-compatible with CPython 2.6.4, it is not binary compatible. C extension modules compiled against one will need to be recompiled to work with the other.
The merger of Unladen Swallow should have minimal impact on long-lived CPython optimization branches like WPython. WPython [104] and Unladen Swallow are largely orthogonal, and there is no technical reason why both could not be merged into CPython. The changes needed to make WPython compatible with a JIT-enhanced version of CPython should be minimal [113]. The same should be true for other CPython optimization projects (for example, [114]).
Invasive forks of CPython such as Stackless Python [115] are more challenging to support. Since Stackless is highly unlikely to be merged into CPython [116] and an increased maintenance burden is part and parcel of any fork, we consider compatibility with Stackless to be relatively low-priority. JIT-compiled stack frames use the C stack, so Stackless should be able to treat them the same as it treats calls through extension modules. If that turns out to be unacceptable, Stackless could either remove the JIT compiler or improve JIT code generation to better support heap-based stack frames [117], [118].
Platform SupportUnladen Swallow is inherently limited by the platform support provided by LLVM, especially LLVM’s JIT compilation system [7]. LLVM’s JIT has the best support on x86 and x86-64 systems, and these are the platforms where Unladen Swallow has received the most testing. We are confident in LLVM/Unladen Swallow’s support for x86 and x86-64 hardware. PPC and ARM support exists, but is not widely used and may be buggy (for example, [99], [83], [100]).
Unladen Swallow is known to work on the following operating systems: Linux, Darwin, Windows. Unladen Swallow has received the most testing on Linux and Darwin, though it still builds and passes its tests on Windows.
In order to support hardware and software platforms where LLVM’s JIT does not work, Unladen Swallow provides a ./configure --without-llvm
option. This flag carves out any part of Unladen Swallow that depends on LLVM, yielding a Python binary that works and passes its tests, but has no performance advantages. This configuration is recommended for hardware unsupported by LLVM, or systems that care more about memory usage than performance.
Unladen Swallow’s JIT compiler operates on CPython bytecode, and as such, it is immune to Python language changes that affect only the parser.
We recommend that changes to the CPython bytecode compiler or the semantics of individual bytecodes be prototyped in the interpreter loop first, then be ported to the JIT compiler once the semantics are clear. To make this easier, Unladen Swallow includes a --without-llvm
configure-time option that strips out the JIT compiler and all associated infrastructure. This leaves the current burden of experimentation unchanged so that developers can prototype in the current low-barrier-to-entry interpreter loop.
Unladen Swallow began implementing its JIT compiler by doing straightforward, naive translations from bytecode implementations into LLVM API calls. We found this process to be easily understood, and we recommend the same approach for CPython. We include several sample changes from the Unladen Swallow repository here as examples of this style of development: [26], [27], [28], [29].
DebuggingThe Unladen Swallow team implemented changes to gdb to make it easier to use gdb to debug JIT-compiled Python code. These changes were released in gdb 7.0 [17]. They make it possible for gdb to identify and unwind past JIT-generated call stack frames. This allows gdb to continue to function as before for CPython development if one is changing, for example, the list
type or builtin functions.
Example backtrace after our changes, where baz
, bar
and foo
are JIT-compiled:
Program received signal SIGSEGV, Segmentation fault. 0x00002aaaabe7d1a8 in baz () (gdb) bt #0 0x00002aaaabe7d1a8 in baz () #1 0x00002aaaabe7d12c in bar () #2 0x00002aaaabe7d0aa in foo () #3 0x00002aaaabe7d02c in main () #4 0x0000000000b870a2 in llvm::JIT::runFunction (this=0x1405b70, F=0x14024e0, ArgValues=...) at /home/rnk/llvm-gdb/lib/ExecutionEngine/JIT/JIT.cpp:395 #5 0x0000000000baa4c5 in llvm::ExecutionEngine::runFunctionAsMain (this=0x1405b70, Fn=0x14024e0, argv=..., envp=0x7fffffffe3c0) at /home/rnk/llvm-gdb/lib/ExecutionEngine/ExecutionEngine.cpp:377 #6 0x00000000007ebd52 in main (argc=2, argv=0x7fffffffe3a8, envp=0x7fffffffe3c0) at /home/rnk/llvm-gdb/tools/lli/lli.cpp:208
Previously, the JIT-compiled frames would have caused gdb to unwind incorrectly, generating lots of obviously-incorrect #6 0x00002aaaabe7d0aa in ?? ()
-style stack frames.
Highlights:
Lowlights:
The Unladen Swallow team is working with Apple to get these changes incorporated into their future gdb releases.
ProfilingUnladen Swallow integrates with oProfile 0.9.4 and newer [18] to support assembly-level profiling on Linux systems. This means that oProfile will correctly symbolize JIT-compiled functions in its reports.
Example report, where the #u#
-prefixed symbol names are JIT-compiled Python functions:
$ opreport -l ./python | less CPU: Core 2, speed 1600 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % image name symbol name 79589 4.2329 python PyString_FromFormatV 62971 3.3491 python PyEval_EvalCodeEx 62713 3.3354 python tupledealloc 57071 3.0353 python _PyEval_CallFunction 50009 2.6597 24532.jo #u#force_unicode 47468 2.5246 python PyUnicodeUCS2_Decode 45829 2.4374 python PyFrame_New 45173 2.4025 python lookdict_string 43082 2.2913 python PyType_IsSubtype 39763 2.1148 24532.jo #u#render5 38145 2.0287 python _PyType_Lookup 37643 2.0020 python PyObject_GC_UnTrack 37105 1.9734 python frame_dealloc 36849 1.9598 python PyEval_EvalFrame 35630 1.8950 24532.jo #u#resolve 33313 1.7717 python PyObject_IsInstance 33208 1.7662 python PyDict_GetItem 33168 1.7640 python PyTuple_New 30458 1.6199 python PyCFunction_NewEx
This support is functional, but as-yet unpolished. Unladen Swallow maintains a punchlist of items we feel are important to improve in our oProfile integration to make it more useful to core CPython developers [19].
Highlights:
Lowlights:
We recommend using oProfile 0.9.5 (and newer) to work around a now-fixed bug on x86-64 platforms in oProfile. oProfile 0.9.4 will work fine on 32-bit platforms, however.
Given the ease of integrating oProfile with LLVM [21] and Unladen Swallow [22], other profiling tools should be easy as well, provided they support a similar JIT interface [23].
We have documented the process for using oProfile to profile Unladen Swallow [24]. This document will be merged into CPython’s Doc/
tree in the merge.
In order to use LLVM, Unladen Swallow has introduced C++ into the core CPython tree and build process. This is an unavoidable part of depending on LLVM; though LLVM offers a C API [8], it is limited and does not expose the functionality needed by CPython. Because of this, we have implemented the internal details of the Unladen Swallow JIT and its supporting infrastructure in C++. We do not propose converting the entire CPython codebase to C++.
Highlights:
./configure --without-llvm
, which even omits the dependency on libstdc++
.Lowlights:
LLVM is released regularly every six months. This means that LLVM may be released two or three times during the course of development of a CPython 3.x release. Each LLVM release brings newer and more powerful optimizations, improved platform support and more sophisticated code generation.
LLVM releases usually include incompatible changes to the LLVM C++ API; the release notes for LLVM 2.6 [9] include a list of intentionally-introduced incompatibilities. Unladen Swallow has tracked LLVM trunk closely over the course of development. Our experience has been that LLVM API changes are obvious and easily or mechanically remedied. We include two such changes from the Unladen Swallow tree as references here: [10], [11].
Due to API incompatibilities, we recommend that an LLVM-based CPython target compatibility with a single version of LLVM at a time. This will lower the overhead on the core development team. Pegging to an LLVM version should not be a problem from a packaging perspective, because pre-built LLVM packages generally become available via standard system package managers fairly quickly following an LLVM release, and failing that, llvm.org itself includes binary releases.
Unladen Swallow has historically included a copy of the LLVM and Clang source trees in the Unladen Swallow tree; this was done to allow us to closely track LLVM trunk as we made patches to it. We do not recommend this model of development for CPython. CPython releases should be based on official LLVM releases. Pre-built LLVM packages are available from MacPorts [12] for Darwin, and from most major Linux distributions ([13], [14], [16]). LLVM itself provides additional binaries, such as for MinGW [25].
LLVM is currently intended to be statically linked; this means that binary releases of CPython will include the relevant parts (not all!) of LLVM. This will increase the binary size, as noted above. To simplify downstream package management, we will modify LLVM to better support shared linking. This issue will block final merger [97].
Unladen Swallow has tasked a full-time engineer with fixing any remaining critical issues in LLVM before LLVM’s 2.7 release. We consider it essential that CPython 3.x be able to depend on a released version of LLVM, rather than closely tracking LLVM trunk as Unladen Swallow has done. We believe we will finish this work [98] before the release of LLVM 2.7, expected in May 2010.
Building CPythonIn addition to a runtime dependency on LLVM, Unladen Swallow includes a build-time dependency on Clang [5], an LLVM-based C/C++ compiler. We use this to compile parts of the C-language Python runtime to LLVM’s intermediate representation; this allows us to perform cross-language inlining, yielding increased performance. Clang is not required to run Unladen Swallow. Clang binary packages are available from most major Linux distributions (for example, [15]).
We examined the impact of Unladen Swallow on the time needed to build Python, including configure, full builds and incremental builds after touching a single C source file.
./configure CPython 2.6.4 CPython 3.1.1 Unladen Swallow r988 Run 1 0m20.795s 0m16.558s 0m15.477s Run 2 0m15.255s 0m16.349s 0m15.391s Run 3 0m15.228s 0m16.299s 0m15.528s Full make CPython 2.6.4 CPython 3.1.1 Unladen Swallow r988 Run 1 1m30.776s 1m22.367s 1m54.053s Run 2 1m21.374s 1m22.064s 1m49.448s Run 3 1m22.047s 1m23.645s 1m49.305sFull builds take a hit due to a) additional .cc
files needed for LLVM interaction, b) statically linking LLVM into libpython
, c) compiling parts of the Python runtime to LLVM IR to enable cross-language inlining.
Incremental builds are also somewhat slower than mainline CPython. The table below shows incremental rebuild times after touching Objects/listobject.c
.
As with full builds, this extra time comes from statically linking LLVM into libpython
. If libpython
were linked shared against LLVM, this overhead would go down.
We propose focusing our efforts on eventual merger with CPython’s 3.x line of development. The BDFL has indicated that 2.7 is to be the final release of CPython’s 2.x line of development [30], and since 2.7 alpha 1 has already been released, we have missed the window. Python 3 is the future, and that is where we will target our performance efforts.
We recommend the following plan for merger of Unladen Swallow into the CPython source tree:
py3k-jit
as a strawman. This will be a branch of the CPython py3k
branch.py3k
. The further we deviate, the harder our work will be.py3k-jit
branch.py3k
branch (once reviewed and approved) and be merged back into the py3k-jit
branch.Because Google uses CPython 2.x internally, Unladen Swallow is based on CPython 2.6. We would need to port our compiler to Python 3; this would be done as patches are applied to the py3k-jit
branch, so that the branch remains a consistent implementation of Python 3 at all times.
We believe this approach will be minimally disruptive to the 3.2 or 3.3 release process while we iron out any remaining issues blocking final merger into py3k
. Unladen Swallow maintains a punchlist of known issues needed before final merger [31], which includes all problems mentioned in this PEP; we trust the CPython community will have its own concerns. This punchlist is not static; other issues may emerge in the future that will block final merger into the py3k
branch.
Changes will be committed directly to the py3k-jit
branch, with only large, tricky or controversial changes sent for pre-commit code review.
There is a chance that we will not be able to reduce memory usage or startup time to a level satisfactory to the CPython community. Our primary contingency plan for this situation is to shift from an online just-in-time compilation strategy to an offline ahead-of-time strategy using an instrumented CPython interpreter loop to obtain feedback. This is the same model used by gcc’s feedback-directed optimizations (-fprofile-generate) [111] and Microsoft Visual Studio’s profile-guided optimizations [112]; we will refer to this as “feedback-directed optimization” here, or FDO.
We believe that an FDO compiler for Python would be inferior to a JIT compiler. FDO requires a high-quality, representative benchmark suite, which is a relative rarity in both open- and closed-source development. A JIT compiler can dynamically find and optimize the hot spots in any application – benchmark suite or no – allowing it to adapt to changes in application bottlenecks without human intervention.
If an ahead-of-time FDO compiler is required, it should be able to leverage a large percentage of the code and infrastructure already developed for Unladen Swallow’s JIT compiler. Indeed, these two compilation strategies could exist side by side.
Future WorkA JIT compiler is an extremely flexible tool, and we have by no means exhausted its full potential. Unladen Swallow maintains a list of yet-to-be-implemented performance optimizations [50] that the team has not yet had time to fully implement. Examples:
This list is by no means exhaustive. There is a vast literature on optimizations for dynamic languages that could and should be implemented in terms of Unladen Swallow’s LLVM-based JIT compiler [54].
LicensingAll work on Unladen Swallow is licensed to the Python Software Foundation (PSF) under the terms of the Python Software Foundation License v2 [56] under the umbrella of Google’s blanket Contributor License Agreement with the PSF.
LLVM is licensed [57] under the University of llinois/NCSA Open Source License [58], a liberal, OSI-approved license. The University of Illinois Urbana-Champaign is the sole copyright holder for LLVM.
References CopyrightThis document has been placed in the public domain.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4