AltiVec shark logo

AltiVec Engaged

James Klicman

May 2002

Introduction

After laying my hands on the PowerBook G4 Titanium from which I'm now typing, I knew that someday I would write code for it's unique AltiVec unit, a.k.a Velocity Engine. It started out simply as an attempt to import code found at motorola.com into the MJPEG tools MPEG-1/2 video encoder. That turned out to be just the beginning.

I have now spent quite a bit of spare time optimizing MJPEG's encoder. Currently, almost every function that was optimized for MMX/SSE or 3DNow! in has been optimized for AltiVec. There are two functions, iquant_non_intra_m1 and select_dct_type, that are not AltiVec optimized. They are certainly candidates for optimization. Time constraints and a desire to release the software are the reasons for their absence. Both functions consume a small part of the overall execution time so their lack of optimization does not put the AltiVec encoder at much of a disadvantage.

One function subsample_image is AltiVec optimized and not MMX/SSE or 3DNow! optimized. This function is only called once per frame. It was optimized to offset additional calculations caused in the original code from padding the frames to ensure 16 byte alignment.

Some of the functions can certainly be made faster. I wrote many of them before I had access to SIM_G4. The knowledge I gained by studying SIM_G4's simulation reports helped me make significant improvements to already optimized code. Luckily, SIM_G4 and other tools are now freely available through Apple.

Motion Search
 AltiVecMMX/SSE
bsadX, *X, *
bsumsqXX
bsumsq_sub22XX
build_sub22_mestsXX
build_sub44_mestsXX
find_best_one_pelXX
mblocks_sub44_mests*X
sad_00XX
sad_01XX
sad_10XX
sad_11XX
sad_sub22*X
sad_sub44*X
subsample_imageX-
sumsqXX
sumsq_sub22XX
varianceXX
Quantization
 AltiVecMMX/SSE
iquant_non_intra_m1-X
quant_non_intraXX
quant_weight_coeff_sumXX
 
Prediction
 AltiVecMMX/SSE
pred_compXX
 
Transformation
 AltiVecMMX/SSE
fdctXX
idctXX
add_predXX
sub_predXX
select_dct_type-X
XOptimized
X, *Optimized, but not called.
-No optimization
*Not called, does not affect performance.

 

Function Benchmarks

During development, I integrated two features which have been extremely helpful, verification and benchmarking. Verification helped ensure that the each AltiVec function behaves exactly as it's C counterpart or with expected deviations. Benchmarking of course, has helped guide performance in the right direction. Both features operate at the function level, verification of the results of each function call and benchmarking at specified intervals. The benchmarking feature was implemented before SIM_G4 was freely available. While SIM_G4 gives more accurate and detailed performance data, I still find the benchmarking feature useful.

Each benchmarked function has a competitor which it is timed against. The competitor can be the original C function, or another AltiVec function. During benchmark initialization, there is a calibration phase which finds the minimum number of iterations required for accurate timing. The largest number of iterations for the two competitors is chosen. At specified intervals both functions are timed running the number of calibrated iterations. Due to operating system overhead and other running processes, it is more difficult to benchmark highly-competitive functions. Times can vary faster or slower by a few percent, even when a function competes against itself. By sampling at regular intervals, much more accuracy is achieved. The sampling also shows variations of performance depending on function input, which can help pinpoint and improve worst case performance.

AltiVec Performance Increases by Function
function bar chart
1 = normal performance    2 = 2X greater performance

 

Encoder Benchmarks

The two machines used for the encoder benchmarks are not contrived to fuel any PowerPC vs. x86 debate, they are simply machines that I currently use. Of course, I can't say I'm unbiased. Since I wrote the AltiVec code, I would like to see the G4 outperform the Athlon. :-) It is no coincidence that the benchmarks follow my wishes, I have improved the AltiVec code until such was the case. If I had an Athlon 800Mhz, it's possible that these charts would show a 500Mhz G4 outperforming an 800Mhz Athlon. In any case, the machines are fairly evenly matched and do provide insight into the relative speed increases achievable with AltiVec, MMX/SSE and 3DNow!.

There are six encoders used in the benchmarks, each with different options enabled or compiler flags used. I found that there was a significant difference in performance between executables compiled with the -mcpu=i586 and -mcpu=i686 flags on the Athlon. This led me to experiment with the -mcpu option for the G4 which resulted in a minor performance improvement using -mcpu=750. There could be a couple reasons for this, either the optimizer for PowerPC is not on par with the optimizer for x86, or code compiled for one PowerPC processor will run well on all PowerPC processors, a characteristic that appears untrue for x86.

The video clip used in the benchmarks is from NTSC Mini-DV source. A short 11 second (330 frames) clip was chosen, it's a zoom shot of a parrot in case your curious. The clip was scaled down for VCD and SVCD. De-interlaced copies of the SVCD and DVD clips were also made. Each video clip was then exported as uncompressed YUV data in MJPEG's native format. The largest file size of 163MB for the DVD video was small enough to be cached in RAM on both machines.

Each encoder was run 5 times for each clip and encoding parameters. The best 3 times of the 5 runs were then averaged to produce the final time. The encoding parameters common to each run were "-v 0 -a 2 -F 4 -n n -M 0 -d -o /dev/null".

 

VCD Encoding, Default Settings
-f 1 -I 0 parrot-vcd.yuv
vcd encoding chart
time in seconds

 

VCD Encoding FPS, Default Settings
vcd fps chart
frames per second

 

VCD Encoding, High-Quality Settings
-f 1 -I 0 -r 32 -4 1 -2 1 parrot-vcd.yuv
vcd hq encoding chart
time in seconds

 

VCD Encoding FPS, High-Quality Settings
vcd hq fps chart
frames per second

 

I originally intended to have charts for each format, VCD, SVCD, SVCD-Interlaced, DVD and DVD-Interlaced. However, each encoder scaled quite predictably and the charts were very monotonous. Instead the following line chart shows how the two fastest encoders scale for each format.

 

Video Encoding Times by Format, Default Settings
video encoding timeline chart

 

Conclusions

The AltiVec instruction set architecture (ISA) is very well designed and a pleasure to program for. The ability to write AltiVec optimized software in C saves time and avoids the tedium associated with assembly language programming. AltiVec enables G4 processors to compare favorably with x86 processors having significantly higher clock rates.

 

Acknowledgments

I would like to thank Andrew Stevens who contributed significantly to the MMX/SSE and 3DNow! code for MJPEG tools. Andrew's helpful and unintentionally inspiring emails were a motivating factor in my effort and his code helped set the bar for performance. I would also like to thank Karsten Jeppesen of Total Impact for network access to G3 and G4 briQ servers used during testing. Finally, I would like to thank Ian Ollmann for his AltiVec tutorial and his knowledgeable answers on the altivec_forum email list.

 

© 2002 James Klicman

 

gnumeric logo ploticus logo vim logo