After laying my hands on the PowerBook G4 Titanium from which I'm now typing, I knew that someday I would write code for it's unique AltiVec unit, a.k.a Velocity Engine. It started out simply as an attempt to import code found at motorola.com into the MJPEG tools MPEG-1/2 video encoder. That turned out to be just the beginning.
I have now spent quite a bit of spare time optimizing MJPEG's encoder. Currently, almost every function that was optimized for MMX/SSE or 3DNow! in has been optimized for AltiVec. There are two functions, iquant_non_intra_m1 and select_dct_type, that are not AltiVec optimized. They are certainly candidates for optimization. Time constraints and a desire to release the software are the reasons for their absence. Both functions consume a small part of the overall execution time so their lack of optimization does not put the AltiVec encoder at much of a disadvantage.
One function subsample_image is AltiVec optimized and not MMX/SSE or 3DNow! optimized. This function is only called once per frame. It was optimized to offset additional calculations caused in the original code from padding the frames to ensure 16 byte alignment.
Some of the functions can certainly be made faster. I wrote many of them before I had access to SIM_G4. The knowledge I gained by studying SIM_G4's simulation reports helped me make significant improvements to already optimized code. Luckily, SIM_G4 and other tools are now freely available through Apple.
During development, I integrated two features which have been extremely helpful, verification and benchmarking. Verification helped ensure that the each AltiVec function behaves exactly as it's C counterpart or with expected deviations. Benchmarking of course, has helped guide performance in the right direction. Both features operate at the function level, verification of the results of each function call and benchmarking at specified intervals. The benchmarking feature was implemented before SIM_G4 was freely available. While SIM_G4 gives more accurate and detailed performance data, I still find the benchmarking feature useful.
Each benchmarked function has a competitor which it is timed against. The competitor can be the original C function, or another AltiVec function. During benchmark initialization, there is a calibration phase which finds the minimum number of iterations required for accurate timing. The largest number of iterations for the two competitors is chosen. At specified intervals both functions are timed running the number of calibrated iterations. Due to operating system overhead and other running processes, it is more difficult to benchmark highly-competitive functions. Times can vary faster or slower by a few percent, even when a function competes against itself. By sampling at regular intervals, much more accuracy is achieved. The sampling also shows variations of performance depending on function input, which can help pinpoint and improve worst case performance.
|AltiVec Performance Increases by Function|
|1 = normal performance 2 = 2X greater performance|
The two machines used for the encoder benchmarks are not contrived to fuel any PowerPC vs. x86 debate, they are simply machines that I currently use. Of course, I can't say I'm unbiased. Since I wrote the AltiVec code, I would like to see the G4 outperform the Athlon. :-) It is no coincidence that the benchmarks follow my wishes, I have improved the AltiVec code until such was the case. If I had an Athlon 800Mhz, it's possible that these charts would show a 500Mhz G4 outperforming an 800Mhz Athlon. In any case, the machines are fairly evenly matched and do provide insight into the relative speed increases achievable with AltiVec, MMX/SSE and 3DNow!.
There are six encoders used in the benchmarks, each with different options enabled or compiler flags used. I found that there was a significant difference in performance between executables compiled with the -mcpu=i586 and -mcpu=i686 flags on the Athlon. This led me to experiment with the -mcpu option for the G4 which resulted in a minor performance improvement using -mcpu=750. There could be a couple reasons for this, either the optimizer for PowerPC is not on par with the optimizer for x86, or code compiled for one PowerPC processor will run well on all PowerPC processors, a characteristic that appears untrue for x86.
The video clip used in the benchmarks is from NTSC Mini-DV source. A short 11 second (330 frames) clip was chosen, it's a zoom shot of a parrot in case your curious. The clip was scaled down for VCD and SVCD. De-interlaced copies of the SVCD and DVD clips were also made. Each video clip was then exported as uncompressed YUV data in MJPEG's native format. The largest file size of 163MB for the DVD video was small enough to be cached in RAM on both machines.
Each encoder was run 5 times for each clip and encoding parameters.
The best 3 times of the 5 runs were then averaged to produce the
final time. The encoding parameters common to each run were
"-v 0 -a 2 -F 4 -n n -M 0 -d -o /dev/null".
|VCD Encoding, Default Settings|
|time in seconds|
|VCD Encoding FPS, Default Settings|
|frames per second|
|VCD Encoding, High-Quality Settings|
|time in seconds|
|VCD Encoding FPS, High-Quality Settings|
|frames per second|
I originally intended to have charts for each format, VCD, SVCD, SVCD-Interlaced, DVD and DVD-Interlaced. However, each encoder scaled quite predictably and the charts were very monotonous. Instead the following line chart shows how the two fastest encoders scale for each format.
|Video Encoding Times by Format, Default Settings|
The AltiVec instruction set architecture (ISA) is very well designed and a pleasure to program for. The ability to write AltiVec optimized software in C saves time and avoids the tedium associated with assembly language programming. AltiVec enables G4 processors to compare favorably with x86 processors having significantly higher clock rates.
I would like to thank Andrew Stevens who contributed significantly to the MMX/SSE and 3DNow! code for MJPEG tools. Andrew's helpful and unintentionally inspiring emails were a motivating factor in my effort and his code helped set the bar for performance. I would also like to thank Karsten Jeppesen of Total Impact for network access to G3 and G4 briQ servers used during testing. Finally, I would like to thank Ian Ollmann for his AltiVec tutorial and his knowledgeable answers on the altivec_forum email list.