|
The PowerPC G5 processor is at the heart of Apple's latest computer models. Beyond the obvious increase in clock frequency from the previous generation of G4 CPUs, a number of significant changes have been made to the core CPU and system architecture that can affect the way programs run on these systems. This document is a high-level summary of the more conspicuous features to keep in mind while (re)targeting your programs to run on the G5.
[Sep 03, 2003]
|
Quick Comparison: G4 & G5
Tables 1 through 3 provide a quick reference for comparing various features found in the G4 and G5 processors. A more detailed summary of these differences can be found at the beginning of the section that follows.
Table 1. Core features comparison.
|
G4 |
G5 |
Bits |
32 |
64 |
Clock Speed (GHz) |
0.55 - 1.42 |
1.6 - 2.0 |
Inst per clock |
3 + 1 branch |
4 + 1 branch |
Load Store units |
1 |
2 |
integer units |
3 simple + 1 complex |
2 |
floating point units |
1 |
2 |
Vector units |
1 |
1 |
|
Table 2. Caches comparison.
|
G4 |
G5 |
Cache Line Width (Bytes)
|
32 |
128 |
L1 Instruction Cache
|
32K, 8-way associative |
64K, direct-mapped |
L1 Data Cache
|
32K, write-back, 8-way associative |
32K, write-thru, 2-way associative |
L2 Cache (KB)
|
256 |
512 |
L3 Cache (MB)
|
2 |
0 |
|
Table 3. Memory subsystem comparison.
|
G4 |
G5 |
Data Bus Width (Bits)
|
64 |
128 |
Address Bus Width (Bits)
|
36 |
42 |
Bus Speed (MHz)
|
167 |
800-1000 |
Bus Bandwidth (GB/sec)
|
1.3 |
8.0 |
Memory Bandwidth (GB/sec)
|
2.7 |
6.4 |
Latency (nSec)
|
93 |
135 |
Addressable Memory (GB)
|
2 |
16 |
|
Back to top
Summary of Differences Between the G4 and G5
Compared to the G4, the G5 differs in the following ways:
- Core:
- The G5 has a massive out-of-order execution engine, able to keep >200 instructions in flight vs. >30 for the G4.
- Two double-precision floating point units vs. one for the G4.
- Two load/store units vs. one for the G4.
- Support for 64-bit integer arithmetic vs. 32-bit for the G4.
- Implements FP square root as HW instruction vs. software function for the G4.
- Instructions are tracked in "groups" from dispatch to completion.
- Complex instructions are "cracked" or implemented as microcode.
- New forms of
mtcrf and mfcr -- old instructions are implemented as microcode.
- Velocity Engine single-element loads are implemented as "
lvx ", previous processors zeroed undefined fields.
- Velocity Engine inter-element shifts require the shift count to be replicated in every element of the
VR , whereas previous processors used the right most element's shift count for every shift.
- A much longer execution pipeline (up to 23 stages vs. 7 stages for the G4).
- Two integer units vs. four (3 simple + 1 complex) for the G4. The G5's two integer units are more capable than the G4's simple integer units (they can both handle multiply, and one can do divide), while only the G4's complex unit can do either multiply or divide.
- Branch mispredicts are more costly because of the deeper pipelines.
- Misaligned load/stores to uncached memory always take alignment exceptions.
- Caches:
- Larger L2 cache (512K vs. 256K), but no L3 cache.
- 128 Byte cache lines vs. 32 Bytes for the G4.
- L1 data cache is 32K, write-through, 2-way associative vs. 32K, write-back, 8-way associative for the G4
- No L1 allocation on store misses.
- L1 instruction cache is 64K direct-mapped vs. 32K 8-way associative for the G4.
- Memory subsystem:
- Vastly increased system memory bandwidth.
- Improved hardware prefetch mechanism that is self-starting for established sequential access patterns.
- Larger addressable memory space (up to 16 GB with U3 memory controller vs. 2 GB for the G4 with U2 memory controller).
- Increased memory latency -- 135ns best case vs. 95ns best case for the G4.
Back to top
Performance Do's for the G5
Take advantage of the additional double-precision FPU
The G5 has two complete double-precision floating-point units, and each one offers better performance than the single floating-point unit in the G4. Having 2 FP scalar units can be viewed by software as 2-way double precision vectors. To make sure that you can get the best use of the additional FPU, schedule your code so that dependencies are minimized (via loop unrolling, software pipelining, etc.) so that no one FPU is the bottleneck in your code. Write your FP code so that it can run on both FPUs simultaneously -- each unit has 6 cycle execution latency, so SW should attempt to fill 12 pipeline slots. It may be simpler for SW to treat the CPU as having a single FPU with 12 cycle execution latency.
Take advantage of the hardware prefetch engine
The G5 contains a self-starting prefetch engine, capable of automatically prefetching data along four different streams. The prefetches begin without user intervention -- if a pattern of two or more load misses with a sequential cache line stride are detected, the prefetch engine begins to prefetch sequential cache lines into the L1 and L2 following the established pattern. The prefetch engine is paced by demand misses and will continue until a page boundary (4K) is reached. Up to four unique prefetch streams can be active at once. Note that the G5 does not prefetch store misses. Prefetches can also be initiated using the DCBTL instruction. DCBTL operates much like DCBT , except multiple cache lines are prefetched starting at a given address, and the direction of the prefetch stream can be specified (up or down). DCBTL avoids the startup-cost of the automatic stream detection used by the hardware prefetcher in cases where the programmer knows the data usage pattern in advance. Unlike DST , DCBTL -initiated prefetches cannot be stopped via software, though they share the same constraints as the hardware prefetch engine.
Take advantage of the two load/store units
The additional load/store unit on the G5 can allow more memory accesses to be processed per cycle than the G4. Combine this with the higher bandwidth available to the processor and you end up with a compute engine capable of consuming enormous amounts of data. Reworking your code to take advantage of both available load/store units (while being careful not to make them dependent upon each other) can greatly improve the performance of your code.
Take advantage of the full precision hardware square root
The G5 has a full-precision hardware square root implementation. If your code executes square root, check for the availability of the hardware square root in the G5 and execute code calling the instruction directly (e.g. __fsqrt() ) instead of calling the sqrt() routine. (Use __fsqrts() for single-precision.) You can use the GCC compiler flags -mpowerpc-gpopt and -mpowerpc64 to transform sqrt() function calls directly into the PPC sqrt instruction.
Align hot code for maximum instruction dispatch
Alignment of code in memory is a more significant factor for performance on the G5 than on the G4. The G5 fetches instructions in aligned 32 byte blocks. Therefore, it is often profitable to align hot loops, branches, or branch targets to fetch boundaries. GCC 3.3 offers the following flags to align code: -falign-functions=32 , -falign-labels=32 , -falign-loops=32 , -falign-jumps=32 . Additionally, you may need to specify -falign-[functions, labels, loops, jumps]-max-skip=[15,31] to get the desired alignment.
IMPORTANT:
Extensive use of these alignment flags will substantially increase the size of the compiled executable code generated by GCC.
|
Back to top
Performance Don'ts for the G5
Here is a listing of changes to consider for code that was previously optimized for the G4:
Carefully evaluate the use of DST and its derivatives
Data Stream Touch instructions cannot be executed speculatively on the G5, and thus cause execution serialization. These instructions include DST , DSTST , and others. If a DST instruction is encountered, the entire execution engine must be allowed to drain so that the DST is the next instruction to complete before it can be executed. This may cause large bubbles in program execution time to service a software-directed prefetch that is not guaranteed to be executed completely anyway (according to PPC design). The DST implementation on the G5 does not fetch across 4K page boundaries; if the DST encounters a page boundary, it is terminated. Transient hints for DST s are ignored, and strides are assumed to be a power of 2. DST should thus be used with great care. In some cases they may still be beneficial, however, in general the hardware prefetch engine built into the G5 will do a good job prefetching data that fits within a page (4K) and is sequential in nature. DCBT does not incur the execution serialization penalty of DST . The problem with DCBT is that you may need to issue several to cover the space that would have been prefetched by a single DST . To prefetch contiguous memory blocks, DCBTL may be used in place of multiple DCBT s.
DCBZ semantics
Developers have assumed that DCBZ operates on 32 byte chunks of data because the cache lines in previous Macintosh PowerPC systems have always been 32 bytes in size. However, the cache line size in the G5 is 128 bytes. The DCBZ instruction on the G5 still operates on 32 byte quantities. However, it does so in a very inefficient manner -- when a DCBZ is encountered, the entire 128 byte cache line containing the requested 32 bytes to be zeroed is fetched from memory. This typically causes a cast out of a cache line, which, if dirty, must be written out to memory. Once the 128 byte cache line is loaded, the desired 32 bytes are then zeroed. Thus, DCBZ in the G5 has three detrimental effects on the performance of existing programs: Loops written using existing DCBZ instructions that stride by 32 bytes will cause redundant, unnecessary DCBZs to be issued; up to 75% of the DCBZ s will already have their requested data in cache, already fetched on behalf of an earlier DCBZ . However, the subsequent DCBZ s are still necessary to zero out the 32 byte chunk of the cache line. If the stride between DCBZ s is greater than 128 bytes, then a great deal of memory bandwidth is wasted because the CPU can only make cache line-sized requests to memory, and as little as 25% of the memory bandwidth is useful. The intent of most code that uses DCBZ is to avoid a store miss to memory. In most cases, the G5 implementation will actually cause a store miss. The use of DCBZ should thus be assessed with great care. If possible use the DCBZL instruction instead of DCBZ . DCBZL functions just like DCBZ on the G4, except it operates on the native (entire 128 Bytes) cache line present in the G5. (DCBZL will always operate on the native cache line size, for the G4 that's 32 bytes and for the G5 it's 128 bytes.) To use DCBZL , you must query the OS for the cache line size of the CPU and write code that takes the cache line size into account.
DCBI and DCBA are illegal
The G5 does not support the DCBI or DCBA instructions. Do not use them.
Longer latency instructions
If you have written very tight loops that depend upon the latency of operations on the G4, your code may encounter stalls on the G5. These stalls can be addressed by better code scheduling, loop unrolling and software pipelining. Longer latencies to memory. As the CPU frequency has increased at a much faster rate than the memory frequency, the relative time to access memory has increased in the G5 vs. the G4. If at all possible, loops should be unrolled and data should be accessed as early as possible before it is used. Prefetching can be done using DCBT, or using the hardware prefetch engine if the stride is regular and established early. Another scenario to check for is invariant loads, i. e. loading of data that does not change inside a loop. By moving the invariant load outside of the loop, the processor does not need to re-fetch unchanging data from memory during each loop iteration. Removing these unnecessary memory accesses will result in big performance boosts. Frequently, use of global variables causes unnecessary memory accesses. The compiler is forced to be very conservative when globals are used because the compiler must assume worst-case aliasing conditions. Thus, rather than keeping values in registers, the compiler will load and store from memory to ensure correctness.
Velocity Engine issue constraints
Instruction issue to the Velocity Engine units on the G5 is the same as that in the 7400/7410 G4: only two instructions can be issued to the Velocity Engine units in the same cycle if one of them is a vector permute. In the 745X G4, these issue constraints were relaxed to allow any two Velocity Engine instructions to be issued each cycle. If your code differentiates between the different Velocity Engine issue schemes, choose the 7400-targeted one for use on the G5. Of course, your code may still need to be restructured to handle the increased latencies of the G5 Velocity Engine pipeline. Avoid small data accesses. Due to the increased latency to memory, the longer cache lines, and the nature of the CPU-to-memory bus, small data accesses should be avoided if possible. The entire system architecture has been designed to optimize the transfer of large amounts of data (i. e. maximize system memory throughput). As a side effect, the cost to handle small accesses can be very high and is quite inefficient. If possible, allocate data in large chunks to better amortize the overhead to access memory. Adjust to the smaller cache. High-performance programs that have tuned themselves for the presence of an L3 cache will need to be re-worked to fit in the (now larger) 512K L2 cache. The Effective-to-Real Address Translation (ERAT) cache contains 128 entries, enough to map 512K of data, the same size as the L2 cache. Thus, if you optimize your code for the 512K L2, you will maximize the use of the ERAT in the process.
Avoid branch mispredictions
Write as much straight-line code as possible, inline function calls and unroll loops. Assembly coders can use the new AT branch prediction bits via the ++ and -- suffixes to statically predict highly predictable branches, such as those used for exception checking.
Use fewer locks
Due to the increased number of execution pipeline stages and the increased latency to memory, the time to access and acquire a lock can be up to 2.5 times slower than on the G4. While there is little that can be done to speed up the execution of the locks themselves, reducing the number of locks used in your code can drastically improve its overall performance on the G5.
The resolution of the reservation made by the lwarx instruction is one cache line. Since this was increased from 32 bytes on the G4 to 128 bytes on the G5 this means that synchronization primitives that use lwarx /stwcx pairs are four times more likely to fail on the G5. Specifically avoid storing multiple mutexes in the same cache line.
Type conversion is costly
Scalar type conversion is never good for performance on PPC because it requires memory accesses. On the G5, the problem is magnified because stores followed by a load of the same address cause a dispatch group rejection and flush. This bubble can severely hamper high-performance code. If you must convert from float to integer or vice-versa in performance-critical code, consider padding the code with nops to separate the load from the previous store. GCC is currently unable to perform this optimization, but may add this feature with the -mtune=power4 or -mcpu=power4 optimization flags in the future.
To get the integer part of a floating-point variable, the floor() function should be used instead of a float to int to float assignment.
Avoid microcode
The G5 core implements several instructions in microcode. These instructions cause a pipeline bubble during decode. The most commonly used microcoded instructions are load and store multiple -- lmw and stmw . These are often generated by the compiler to save space when saving and restoring registers on the stack. You can force GCC to avoid these instructions by specifying -mnomultiple . Indexed forms and/or algebraic forms of updating load and stores are also executed as microcode. You can force GCC to avoid these instructions by specifying -mno-update .
IMPORTANT:
Extensive use of these code generation flags will substantially increase the size of the compiled executable code generated by GCC.
|
Back to top
How to make your code run best on the G5
Use Shark (available in the CHUD tools package at <http://developer.apple.com/tools/performance/>) to determine where your current code might fall down on the G5. Shark can statically analyze a binary for potentially poor-performing code (like the presence of DCBZ s and DST s), and it can dynamically measure where time is spent in your code during program execution. In addition, Shark models dispatch group formation and execution unit usage of the G5 core -- these are critical factors for achieving good performance on the G5. Use GCC 3.3 and experiment with optimization flags. GCC offers a wide range of options. You should never guess which options will work best for the G5. Always experiment with the different options and use profile results to determine which options are best for your code. Details of the available G5 optimization flags are documented in TN2086.
In tuning your program's code to run on the G5, you may end up with code that runs better on the G4 in the process. Of course, you may end up requiring more than one code path to optimally target G3/G4/G5, but if performance is a priority, then there are not many alternatives. You can use Gestalt , sysctl and/or _cpu_capabilities to determine the features (i. e. Velocity Engine, DCBA , DST , etc.) that are available on the target platform so your code can choose an optimal execution path.
Back to top
References
Apple's Developer Connection Documentation
URLs
Back to top
|