Accelerate Release Notes

In our spare time, however, we did manage to rewrite/reoptimize in excess of 7000 entrypoints for three new architectures (21000 in total), i386, ppc64 and x86_64.

Mac OS X v10.4

Overview

This release note describes additions to Accelerate.framework MacOS X.4, Tiger to enhance the SIMD (vector, e.g AltiVec and SSE3) programming environment and to provide a wide diversity of hardware accelerated APIs that may be used for high performance computation from ordinary C and FORTRAN code. The Accelerate.framework was introduced in MacOS X.3 an umbrella framework that contains vecLib.framework and vImage.framework. MacOS X.4 introduces new vImage functionality and an array math library called vForce for high throughput numerical computation. In addition, several new architectures added in versions of MacOS X.4.0, X.4.4 and X.4.7 to provide support ppc64, i386 and x86_64. As of MacOS X.4.7 (for Mac Pro), the framework supports all four architectures: 32- and 64-bit code on both PowerPC and Intel.

High Performance Hardware Agnostic Vector Engine Support

Among other things, Accelerate.framework is the fundamental support library for SIMD programming (both AltiVec and SSE/SSE2/SSE3/...) on MacOS X. The vecLibTypes.h header (automatically included when you #include <Accelerate/Accelerate.h>) defines unified 128-bit SIMD data types that work for both AtliVec and Intel's vector architecture. Using these types (e.g. vFloat, vSInt32), it is possible to write a single piece of vector code that compiles and runs on both the PowerPC and Intel vector engines on both 32- and 64-bit architectures using GCC.

scalar type	vector type
float	vFloat
double	vDouble (even on AltiVec vectors)
uint8_t	vUInt8
int8_t	vSInt8
uint16_t	vUInt16
int16_t	vSInt16
uint32_t	vUInt32
int32_t	vSInt32

n addition, Accelerate.framework provides hardware agnostic support processing of vector floating point and vector integer data through its vMathLib, vBasicOps and vBigNum components. If you need to take the sine of vFloat, you can call vsinf( vFloat ), regardless of whether the vFloat is actually a AltiVec vector float or a Intel __m128 through Accelerate's vMathLib library. Likewise vBasicOps and vBigNum components provide support for integer and large integer arithmetic using vector variables. Please see Accelerate.framework/vfp.h, vBasicOps.h and vBigNum.h for a complete list.

High Performance Algorithms in Accelerate.framework

To be clear, Accelerate.framework is very much not just for vector code. The vast majority of it is callable from vanilla C/C++/ObjC scalar code. (BLAS, vForce and LAPACK are also callable from FORTRAN.) The bulk of Accelerate.framework is devoted to a collection of facilities covering digital signal processing (vDSP), Matrix computations (BLAS), linear algebra (LAPACK) to provide high performance, hardware agnostic support for common arithmetic algorithms from signal processing, linear algebra, image processing and scientific calculation. It is intended to be a one-stop shopping place for high performance code of all kinds.

    Accelerate.framework ----------  vecLib.framework ----------(libvMisc) ----- vForce

        |                   |     |        \             |       \

        +---vImage.framework      vDSP   BLAS     LAPACK     vMathLib    vBasicOps / vBigNum

            All these APIs may be accessed through Accelerate.framework

The APIs in Accelerate.framework are designed to transparently run on the fastest hardware available on the system. If the machine has a AltiVec or SSE vector unit, that will be used. The scalar units will be used on G3 class computers. Where appropriate, some APIs will even transparently distribute the work across multiple processors where available. The Accelerate.framework provides two important advantages to software developers:

vForce

New for MacOS X.4, Tiger, vForce provides high performance math routines for array computation. A chief problem with legacy math library design (such as plan old sin()/cos()/pow() any C programmer is familiar with) is that they only take one piece of data at a time. On a modern vector equipped, pipelined machine, such a design could use as little as 1/24th of the computational power of the hardware floating point units. In order to get close to saturating the available hardware bandwidth, one needs to have potentially dozens of calculations operating concurrently. That just isn't possible with a conventional math library which operates on one number at a time, or even with vMathLib, which only does four operations at a time. vForce brings math library support to long arrays of numbers, so the hardware can operate on dozens of values concurrently. vForce can realize performance levels in excess of an order of magnitude faster than the conventional math library and several times faster than vMathLib. For a complete list of the functions introduced in MacOS X.4, please see the vForce header, vForce.h to be found in the vecLib subcomponent in the Accelerate.framework.

/System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Headers/vForce.h

vDSP

The vDSP Library provides signal processing functions for applications such as speech, sound, audio, and video processing, diagnostic medical imaging, radar signal processing, seismic analysis, and scientific data processing. The vDSP functions operate on real and complex data types. The functions include data type conversions, fast Fourier transforms (FFTs), and vector-to-vector and vector-to-scalar operations.

The vDSP functions have been implemented in two ways: as vectorized code (for single precision only), which uses the vector unit on the PowerPC G4/G5 microprocessor, and as scalar code, which runs on Macintosh models that have a G3 microprocessor. It is noteworthy to mention that vDSP's FFTs are one of the fastest implementations of the Discrete Fourier Transforms available anywhere.

The vDSP Library itself is included as part of vecLib in Mac OS X. The header file, vDSP.h, is provided for Macintosh software developers; it defines nonstandard data types used by the vDSP functions and symbols accepted as flag arguments to vDSP functions.

vDSP functions are available in single and double precision. Note that only the single precision is vectorized due to the underlying instruction set architecture of the vector engine on board G4 processors.

New in MacOS X.4 for vDSP are a series of basic signal processing operators such as fixed <-> float <-> double converters, simple filters like a Blackman/Hanning windows, resampling functions, maximum and minimum finders, basic statistics operations, polar/cartesian converters, pythagorean, linear interpolators, real and complex matrix multiplication and a long list of basic array operators.

BLAS/LAPACK

Since MacOS X.2, Apple has been shipping a tuned version of the industry standard BLAS and LAPACK libraries, for basic matrix computation and linear algebra. These may be called from both FORTRAN and C. For more information on BLAS and LAPACK, please see the NetLib home page: <http://www.netlib.org/blas/faq.html> <http://www.netlib.org/lapack/> There are also entire books available about just LAPACK, such as the LAPACK User's Guide, ISBN: 0898713455, which may serve as an excellent primary reference. Key pieces of these libraries have been retuned for G5 and Intel.

vImage

Introduced MacOS X.3, vImage is a suite of image processing APIs. vImage provides fast vectorized filters for Convolution (blur, sharpen, emboss, etc. depending on kernel used), Alpha compositing, Geometry operators (Dilate, Erode, Min, Max), Histogram operations and image format conversions. The framework supports both 8-bit / channel and floating point image formats in both planar (single channel) and ARGB packed pixel layouts. Due to the basic nature of these filters, the functions herein can typically be used on other formats, either directly or with some prior The framework is designed for real time performance from the ground up. Where they are needed you can pass in temporary buffers to avoid blocking calls to malloc and to reduce or eliminate zero-fill faults. The framework's own image descriptors are unencapsulated so that the framework can operate directly on your data in place without a lot of unnecessary copying.

For MacOS X.4, vImage was extended to provide data type converters for many more pixel formats. In addition, we introduced several new APIs intended to be used for color correction, including three-channel interleaved formats, 16-bit OpenEXR floats and 16-bit integers. There are fast matrix multiplication functions, polynomial and rational approximation evaluators, and high performance full and half precision gamma functions for image processing. We also added several new Convolution varieties, including new algoriths for box and tent convolves, multiple kernel convolves for interleaved data formats. The performance of existing Convolution functions has been greatly improved through new algorithms.

Mac OS X v10.3

Overview

This release note describes additions to Accelerate.framework and vecLib.framework on MacOS X.3, Jaguar to enhance the AltiVec programming environment and to provide a wide diversity of hardware accelerated APIs that may be used for high performance computation outside of AltiVec code. The Accelerate.framework is new in MacOS X.3. It is an umbrella framework that contains what used to be vecLib.framework, and adds (new to MacOS X.3) a diversity of Image processing functions within the new vImage.framework subframework. New code targeted to MacOS X.3 and later should link to Accelerate.framework instead of vecLib.framework.

AltiVec (a.k.a Velocity Engine, VMX)

The PowerPC vector instruction set architecture contains a separate SIMD style execution unit with inherently high data parallelism. This high degree of parallelism is enhanced with additional parallelism through superscalar dispatch to multiple execution units and execution unit pipelines. All vector instructions are designed to be easily pipelined with pipeline latencies no greater than the scalar double precision floating-point multiply-add fused class of instructions. There are no operating mode switches which preclude fine grain interleaving of instructions with the existing floating-point and integer instructions. Parallelism with the integer and floating-point instructions is simplified by the facts that the vector unit never generates an exception and has few shared resources or communication paths that require it to be tightly synchronized with the other units.

Accelerate.framework

Accelerate.framework is a collection of facilities that covering digital signal processing (vDSP), Matrix computations (BLAS), linear algebra (LAPACK), vector math library (vMathLib), vector integer library (vBasicOps) and a vector large integer library (vBigNum), and new in MacOS X.3 a high performance image processing framework (vImage). It supercedes (and includes) the former vecLib.framework and should be used instead of vecLib.framework for applications targeting MacOS X.3 and later. It is intended to be a one-stop shopping place for high performance code of all kinds.

Accelerate.framework ----------  vecLib.framework ----------(libvMisc) ----- vBigNum

    |                   |     |        \             |       \

    +---vImage.framework      vDSP   BLAS     LAPACK     vMathLib    vBasicOps

        All these APIs may be accessed through Accelerate.framework

The APIs in Accelerate.framework are designed to transparently run on the fastest hardware available on the system. If the machine has a AltiVec unit, that will be used. Otherwise, the scalar units will be used on G3 class computers. The Accelerate.framework provides two important advantages to software developers:

Most Accelerate.framework APIs do not require you to write vector code in order to use them. The majority take simple pointers to arrays and array sizes and can be called from vanilla C code or in many cases even FORTRAN. APIs in vDSP, BLAS, LAPACK and vImage are of this sort. In addition, Accelerate.framework provides APIs to help developers who choose to write their own vector code. These may be found in vMathLib, vBigNum and vBasicOps.

vDSP

BLAS/LAPACK

Since MacOS X.2, Apple has been shipping a tuned version of the industry standard BLAS and LAPACK libraries, for basic matrix computation and linear algebra. These may be called from both FORTRAN and C. For more information on BLAS and LAPACK, please see the NetLib home page: <http://www.netlib.org/blas/faq.html> <http://www.netlib.org/lapack/> There are also entire books available about just LAPACK, such as the LAPACK User's Guide, ISBN: 0898713455, which may serve as an excellent primary reference.

vImage

New in MacOS X.3 is a suite of image processing APIs. If you are familiar with image processing APIs in Intel Integrated Performance Primitives™, SGI's ImageVision™ library, or Adobe Photoshop™ filters, you will find similar functions here. vImage provides fast vectorized filters for Convolution (blur, sharpen, emboss, etc. depending on kernel used), Alpha compositing, Geometry operators (Dilate, Erode, Min, Max), Histogram operations and image format conversions. The framework supports both 8-bit / channel and floating point image formats in both planar (single channel) and ARGB packed pixel layouts. Due to the basic nature of these filters, the functions herein can typically be used on other formats, either directly or with some prior The framework is designed for real time performance from the ground up. Where they are needed you can pass in temporary buffers to avoid blocking calls to malloc and to reduce or eliminate zero-fill faults. The framework's own image descriptors are unencapsulated so that the framework can operate directly on your data in place without a lot of unnecessary copying.

vMathLib

vMathLib provides vector math library support to AltiVec programmers. Standard math library operators such as sine, cosine, power, log, etc. are made available in vector form at vector speeds. Please see Accelerate.framework/vfp.h for a complete list.

vBasicOps / vBigNum

These libraries provide basic vector integer operations like addition, subtraction, division, multiplication, shifts and rotations for vector integers from 8-bits / element all the way up to 1024 bit integers. Please see Accelerate.framework/vBasicOps.h and Accelerate.framework/vBigNum.h for a complete list.

Accelerate Release Notes

Contents:

Mac OS X v10.5

Mac OS X v10.4

Overview

High Performance Hardware Agnostic Vector Engine Support

High Performance Algorithms in Accelerate.framework

vForce

vDSP

BLAS/LAPACK

vImage

Mac OS X v10.3

Overview

AltiVec (a.k.a Velocity Engine, VMX)

Accelerate.framework

vDSP

BLAS/LAPACK

vImage

vMathLib

vBasicOps / vBigNum