Unlike conventional compilation that optimizes and generates object code independently for each individual program module, omniscient compilation optimizes based on a view of all modules, across the entire program.
The PIC32 is a major departure from Microchip’s bread-and-butter offering of 8- and 16-bit microcontrollers, so developing code for the PIC32 poses a new set of challenges. The highest nonvolatile-memory (NVM) density on 8-/16-bit PIC MCUs is 128 kbytes, SRAM is limited to 4 kbytes or less, and 16 registers is the maximum. Midrange PICs have as few as one or two registers and even smaller memory densities.
All 8-/16-bit PIC MCUs have banked memories that require assembly language or nonstandard C extensions to address efficiently, resulting in nonportable code. Thus, in spite of the common peripheral set and development environment shared by the PIC32 and its 8-/16-bit precursors, migrating nonstandard legacy code could pose serious complications.
In contrast to previous PICs, PIC32 devices have as much as 512 kbytes of flash, up to 32 kbytes of SRAM memory, and 32 general purpose registers. Also unlike 8-bit PICs, the PIC32 memory space is linear, so no complicated addressing schemes are required.
With fast DSP instructions for multiply and divide, a 256-byte instruction cache, 5-stage pipeline, direct memory access, and fast context switching, the PIC32 offers instruction throughput of 1.56 DMIPS (Dhrystone million instructions per second)/MHz–the highest of any microcontroller in its class. Its 80 MHz maximum clock provides PIC32 users with unparalleled throughput and flexibility. So, the question is how does an engineer develop code to take full advantage of the PIC32’s horsepower?
One way to squeeze the maximum performance from a PIC32 is to choose a compilation method that exploits the benefits of the architecture while offering a seamless up/down migration path for legacy code. Basically, there are two approaches to compilation for the PIC32 architecture: conventional compilation that optimizes and generates object code independently for each individual program module and « omniscient » compilation that optimizes code based on a view of all program modules, across the entire program.
Pitfalls of conventional compilation
Conventional compilation technology (shown in Figure 1) shadows the modular embedded software design process, in which programs are broken up into modules–partly to accommodate their increasing complexity and partly to distribute programming tasks among teams of engineers to speed up the development process.
Compilers generate code in the same way, individually compiling each module into an independent sequence of low-level machine instructions, without any knowledge about what is in the other modules. Once all the modules are compiled, a linker links the modules together, along with any code being used from precompiled libraries.
The drawback of this approach is that the compiler never has complete information about the program being compiled. The « global » optimization claimed by many vendors is done only within single modules. There is no optimization across all program modules, which leads to the suboptimal allocation of the stack, registers, and memories.
Conventional compilers have restrictive, fixed calling conventions specified by core and MCU vendors to accommodate a traditional compiler’s « ignorance » of the whole program. Overwriting the data in a single register can have catastrophic consequences. Consider the spectacular crash of the European Space Agency’s Arianne rocket, which directed itself toward the Earth, rather than away from it because of a data overflow into the wrong register.
In conventional compilation technology, calling conventions are used to restrict the use of some registers specifically to avoid this type of error. Another thing conventional compilers do to prevent overwriting the registers is to save and restore very large contexts, during interrupts. Since the compiler has no way of knowing when a register is available or what data from another module might be written to it, the only way to ensure registers are not overwritten is to reserve specific registers only for some kinds of data (such as function parameters) and to always save all registers possibly used to the stack when calling a function or saving a context during an interrupt.
These calling strategies attempt to balance which registers are allocated for function parameters and which registers can be used by the called function for internal calculations, based on « average » code. If too many registers are used for parameters, there may not be enough available for the function’s code. If too few are used for function parameters, the stack may be overutilized, wasting both cycles and SRAM resources.
In the case of interrupts, again, a conventional compiler has no way of knowing which registers are used by the interrupt code. Thus, in order to prevent memory overwrites, most compilers save every register that might be used by an interrupt. This is the safest approach. However, since the number of cycles used is a direct function of the number of registers that are saved and restored, interrupt latency can be longer than necessary and performance can suffer. Cycles spent saving and restoring more registers than required for the context are basically wasted.
Another aspect of writing code for the PIC32 is the existence of legacy code. The C language assumes a linear address space, and traditional compilers also assume a linear address space. They have no way of knowing which objects are stored in which memory locations.
When the MCU’s address space is linear, as is the case with the PIC32, there is no problem. When there are separate memory banks or regions, as in other PIC architectures, the software engineer must specify the addresses where objects are stored with assembly language or C extensions, making much legacy PIC code nonportable to the newer devices.
Omniscient code generation
Newer compilers are now available with omniscient code generation (OCG) technology. A compiler with OCG can eliminate arbitrary restrictions on register usage and save only those context registers that are used for each particular interrupt. OCG works by collecting comprehensive data on register, stack, pointer, object, and variable declarations from all program modules before compiling the code (shown in Figure 2).
The OCG compiler analyzes all the program modules in one step and extracts a call graph structure shown Figure 3. Based on the call graph, the OCG compiler creates a pointer reference graph that tracks each instance of a variable having its address taken, plus each instance of an assignment of one pointer to another (either directly, via function return, function parameter passing, or indirectly via another pointer).
The compiler then identifies all objects that can possibly be referenced by each pointer. This information is used to determine exactly the size and scope for each pointer variable (shown in Figure 4). With the PIC32 devices, all pointers are 32-bits wide as there is no advantage in allocating smaller pointers. However, the OCG compiler detects when a pointer only has one target, and side-steps the pointer completely, making it a direct access.
Since an OCG compiler knows exactly which registers are available at any point in the program and also which registers will be needed for every interrupt function in the program, it can generate code that maximizes register coverage, and minimizes stack use, code size, and the number of cycles required to save and restore those registers.
Dynamic Context Generation
The compiler’s contribution to interrupt latency is the way it generates the code that switches contexts. The amount of context switching code is directly related to the number of registers it saves in response to an interrupt. Conventional compilers save every register that might be used by an interrupt because they have no way of knowing which registers will or will not be used by it.
For example, during high-priority interrupts that don’t take advantage of the shadow registers in the PIC32, conventional compilers always save all 32 registers, requiring may, for example require a total of 65 instruction cycles, compared with the OCG requirement of 22. This may not seem like much of a difference, but in an interrupt-intensive application, the CPU could spend thousands of extra cycles unnecessarily saving empty registers. Interrupts that don’t use the shadow registers typically require 20 instruction cycles for the context save/restore.
In contrast, because an OCG compiler knows exactly which functions call, and are called by, other functions, which variables and registers are required, and which pointers are pointing to which memory banks, it also knows exactly which registers will be used for every interrupt in the program. It can generate code accordingly, minimizing both the code size and the cycles required to save and restore the context. An OCG compiler, may save as few as three registers for a high-priority interrupts that don’t use the shadow registers, reducing the total number of required instruction cycles from 65 to just 22–reducing interrupt latency by nearly 70%! For interrupts that don’t use the shadow registers, the OCG compiler reduces interrupt latency by 10%.
Depending on the application, the cycle savings can be substantial (shown in Figure 5). When compiled by a conventional, non-OCG compiler, a simple benchmark program with 65,535 interrupts requires over 8,650,624 cycles for the PIC32 to execute at 80 MHz with two wait states.
The same program, compiled by an OCG compiler takes only 6,356,898 cycles–or 26.5% less (as Table 1 shows). In an interrupt-intensive program, the OCG compiler boosts the CPU’s performance by nearly 25%. A more interrupt-intensive program could see an even larger performance improvement.
Static versus dynamic calling conventions
The large number of registers on the PIC32 provides a substantial opportunity for boosting CPU performance because function parameters and other data, usually stored in SRAM, can be stored in the registers that require fewer cycles to access. Efficiently exploiting the PIC32’s 32 registers can reduce the number of load/store cycles, freeing them up for computation and potentially improving processor throughput.
How much the compiler « knows » about the register usage of a called function plays a big role in how efficiently the compiler exploits the PIC32’s registers. Conventional compilation technology does not have enough intelligence about the whole program to truly optimize register coverage because of the modular nature of embedded software and compilation. They must rely on rigid, static calling conventions that dictate how arguments are passed and values returned. Each calling convention contains a set of rules that defines which CPU registers are to be preserved across calls. All functions in the program must adhere to the same calling convention.
The calling convention in GCC-based compilers specifies a fixed set of four PIC32 registers for passing parameters to functions. If a function requires fewer than four parameters, the compiler still considers all four registers to be used by parameters. They cannot be used for anything else. If the function requires more than four parameters, the extra parameters are passed on the stack in SRAM, even if other nonreserved registers are available.
For example, if a function, function_1, calls another function, function_2, a conventional compiler will put function_1’s parameters on the stack to make room for function_2’s parameters, which will be loaded to the four registers specified by the calling convention. Other registers may be available, but they will not be used unless they have been specified in the calling convention for this purpose. When the second function is complete, the parameters from the first function also may be retrieved from RAM and loaded back into the registers. This cycle- and code-wasting data shifting will occur because the calling convention says it must be so.
In a sample program, with function calls nested three deep and each function having six parameters, a non-OCG PIC32 compiler uses four registers for passing parameters between functions as required by the calling convention. It frequently moves parameters between the four registers and the stack. This data shifting consumes 144 bytes of stack space in SRAM and generates 476 instructions that require 118 CPU cycles to move the data between the registers and the stack every time it happens.
Rather than relying on static calling conventions to allocate data between an unknown number of available registers, a compiler based on an omniscient code generator defers generating object code until a view of the whole program is available. Based on this « global » view of the complete program, the code generator employs optimization techniques that optimize register coverage. Since the compiler knows which registers are available at every point in the program, it can allocate the registers dynamically, based on the resources that are actually available at the time (as shown in Figure 6).
The PIC32 has such a large register set, that it is usually possible to completely avoid using the stack for function parameters. In the code example cited above, the OCG compiler determined that 18 registers were available for the nested functions at this point in the program. Accordingly, it allocated all the parameters to the registers and used the stack only for the return address, cutting stack usage to only 16 bytes of SRAM. Because the parameters remained in the registers as long as they were needed, 30% fewer instructions were required (336) and the number of clock cycles required to execute fell from 118 cycles to only 80 instruction cycles–32% less (see Table 2).
According to Computer Architecture, A Quantitative Approach (John Hennesey and David Patterson, Ap Professional, 1990. Now published by Morgan Kauffman), 30% of a RISC CPU’s cycles are spent moving data. If the smaller contexts and more flexible register coverage provided by an OCG compiler can cut that amount by one third (to 20%), the available number of cycles actual processing will increase by about 15% (from 70% to 80% of cycles). At 80MHz, this is equivalent to increasing the PIC32’s substantial 125 DMIPS capability to 145 DMIPS. In addition, since fewer instructions are generated to move data, the code size is smaller and the amount of SRAM required for the stack is also smaller, potentially allowing the use of a less expensive microcontroller with smaller flash and SRAM memories.
Maintaining code portability
Many engineers will be migrating code to the PIC32 from PIC24 or dsPIC devices. Over time, designers may develop really robust code on the PIC32 and migrate it down to less expensive MCUs to reduce costs in derivative end-products with different feature sets and price points.
Microchip has taken great care to ensure code portability between its 8-/16- and 32-bit PICs, with a common peripheral set and tool integration into its MPLAB IDE. The PIC32 instruction set architecture is extremely C-friendly and greatly reduces designers’ dependency on hand-crafted, nonportable assembly code. In addition, the PIC32’s linear address space and very generous memory resources make it unlikely that software engineers will be forced to resort to assembler to fit the design into scarce NVM or RAM–a common situation with 8-/16-bit MCUs.
However, if legacy PIC code has routines that use C extensions or have been written in assembly code, moving that code to the PIC32 can be extremely problematic. It might have to be rewritten from scratch. When migrating from a PIC32 to a PIC24 or dsPIC with separate memory regions, it may be difficult to avoid using C extensions or hand-crafted code to efficiently use the memories. Again, the compilation technology used to generate object code can play an important role here.
An omniscient code compilation solves this problem as well because it performs at compile time an analysis of the whole program, which it uses to make optimum decisions about memory placement, pointer scoping, and so forth, without any intervention from the programmer. It optimizes memory resources while eliminating any need for assembly code or C-language extensions. This means that legacy C-code code written for the PIC24 or dsPIC always remains straight ANSI C code and can be used as-is and recompiled for the PIC32 with minimal changes. Conversely code written for the PIC32 can be reused in 8-/16-bit PIC MCUs, using an OCG compiler.
Jeffrey O’Keefe is director of education and development at HI-TECH Software. He joined HI-TECH in 1997 as senior software engineer. In 1998 he received his PhD from La Trobe University, Australia, in digital signal processing and has undergraduate degrees in physics, mathematics, and electronics. You may reach him at firstname.lastname@example.org
Matthew Luckman is chief technical officer of HI-TECH Software. He joined HI-TECH Software in 1998 and has a degree in information technology from Queensland University of Technology, Australia. You may reach him at email@example.com.