< Previous PageNext Page >

Hide TOC

Optimizing Your Memory Allocations

Memory is an important resource for your application so it’s important to take the time to examine your application’s memory allocation patterns and make changes as necessary.

You can gather a history of your allocations using the Sampler program or using the malloc_history command-line tool. For more information on analyzing your memory usage, see “Examining Memory Allocation Patterns.”

Memory Allocation in Mac OS X

Mac OS X implements a highly-tuned, threadsafe allocation library, providing standard implementations of the malloc,calloc, realloc, and free routines, among others. If you are allocating memory using older routines such as NewPtr or NewHandle, you should change your code to use malloc instead. The end result is the same since most legacy routines are now wrappers for malloc anyway.

If you are using a custom malloc implementation, you should consider moving to the system-supplied malloc routines. The Mac OS X malloc implementation is highly optimized and fully supports the Apple-provided memory analysis tools. Moving to Apple’s implementation not only gains you the ability to analyze your memory, it lets you remove your custom code from your executable, thus reducing your application footprint.

The following sections provide some details on how the Mac OS X allocation library handles large and small allocations. This information can help you identify the costs associated with each type of allocation. Note that although the following sections talk about the behaviors of the malloc routine, those behaviors also apply to routines such as calloc and realloc.

Allocating Small Memory Blocks

For allocations of less than a few virtual memory pages, malloc suballocates the requested amount from a list (or “pool”) of free blocks of increasing size. Any small blocks you deallocate using the free routine are added back to the pool and reused on a “best fit” basis. The memory pool is itself is comprised of several virtual memory pages and allocated using the vm_allocate routine.

The granularity of any block returned by malloc is 16 bytes. Any blocks you allocate will be at least 16 bytes in size or comprised of a block that is a multiple of 16. Thus, if you request 4 bytes, malloc returns a block of 16 bytes. If you request 24 bytes, malloc returns a block of 32 bytes.

Note: By their nature, allocations smaller than a single virtual memory page in size cannot be page aligned.

Allocating Large Memory Blocks

For allocations greater than a few virtual memory pages, malloc uses the vm_allocate routine to obtain a block of the requested size.The vm_allocate routine assigns an address range to the new block in the virtual memory space of the current process but does not allocate any physical memory. Instead, the malloc routine pages in the memory for the allocated block as it is used.

The granularity of large memory blocks is 4096 bytes, the size of a virtual memory page. If you are allocating a large memory buffer, you should consider making it a multiple of this size.

Note: Large memory allocations are guaranteed to be page-aligned.

For large allocations, you may find that it makes sense to allocate virtual memory using vm_allocate directly. The example in Listing 1 shows how to use the vm_allocate function.

Listing 1 Allocating memory with vm_allocate

void* AllocateVirtualMemory(size_t size)

    char*          data;

    kern_return_t   err;

    // In debug builds, check that we have

    // correct VM page alignment

    check(size != 0);

    check((size % 4096) == 0);

    // Allocate directly from VM

    err = vm_allocate(  (vm_map_t) mach_task_self(),

                        (vm_address_t*) &data,

                        size,

                        VM_FLAGS_ANYWHERE);

    // Check errors

    check(err == KERN_SUCCESS);

    if(err != KERN_SUCCESS)

        data = NULL;

    return data;

Allocating Memory in Batches

If your code allocates multiple, identically-sized memory blocks, you can use the malloc_zone_batch_malloc function to allocate those blocks all at once. This function offers better performance than the equivalent series of calls to malloc to allocate the same memory. Performance is best when the individual block size is relatively small—less than 4K in size. The function does its best to allocate all of the requested memory but may return less than was requested. When using this function, check the return values carefully to see how many blocks were actually allocated.

Batch allocation of memory blocks is supported in Mac OS X version 10.3 and later. For information, see the /usr/include/malloc/malloc.h header file.

About Memory Zones

A zone is a variable-size range of virtual memory from which malloc allocates blocks. All allocations made using the malloc function occur within the standard malloc zone, which is created when malloc is first called by your application. You can create additional malloc zones and allocate memory in a specific zone.

Note: The term zone is synonomous with the terms heap, pool, and arena in terms of memory allocation using the malloc routines.

Zones have the advantage of allowing blocks with similar access patterns or lifetimes to be placed together, theoretically minimizing wasted space or paging activity. You can allocate many objects in a zone and then destroy the zone to free them all. For most developers, however, zones fail to deliver a performance advantage, and you should avoid them unless you need to either track a set of memory blocks separately from other allocations or free many memory blocks quickly, or you have measured a specific case where zones will help.

For information on how to use multiple zones in an application, see “Using Multiple Malloc Zones”

Tips for Allocating Memory

When it comes time to allocate memory for your program, there are other considerations you should make. The following sections provide guidelines on when and how to allocate memory.

Deferring Memory Allocations

Every memory allocation has a performance cost. That cost is measured by the time it takes to allocate the memory and the space occupied by the memory. If you do not need a particular block of memory right away, you should consider deferring its allocation until the first time you actually need it. Once allocated, you can then use it and delete it or cache it for later use.

Applications often allocate memory during initialization and then use that memory later—or sometimes not at all during a given session. Not only does this cause the application to pay an up-front cost for allocating the memory but it does so needlessly. You can easily improve on this costly approach by deferring the allocation to the first time the memory is needed.

For most operations, you can easily arrange your code to use a block of memory right after you allocate it. But if your application uses global variables, you need another way to ensure the memory is there when you need it, but not before. To accomplish this with a minimum of code modification, do the following:

Turn any global variables into static variables so that they are inaccessible to other code modules.
Create a public accessor function to access the static variable and allocate and initialize the buffer for it upon the first invocation.

Listing 2 gives an example of this technique. Code modules that want to access the global buffer call the function to access the pointer.

Listing 2 Lazy allocation of memory through an accessor

MyGlobalInfo* GetGlobalBuffer()

    static MyGlobalInfo* sGlobalBuffer = NULL;

    if ( sGlobalBuffer == NULL )

            sGlobalBuffer = malloc( sizeof( MyGlobalInfo ) );

        return sGlobalBuffer;

Note: This code is not safe in the presence of multiple threads. More than one thread could call this function simultaneously, causing the memory to be allocated more than once. To make it threadsafe, add a semaphore lock around the if statement and any required initialization code.

Initializing Memory

Memory allocated using malloc is not guaranteed to be initialized with zeroes. Instead of using memset to initialize the memory, a better choice is to use the calloc routine to allocate the memory in the first place.

When you call memset right after malloc, the virtual memory system must map the corresponding pages into memory in order to zero-initialize them. This operation can be very expensive and wasteful, especially if you do not use the pages right away.

The calloc routine reserves the required virtual address space for the memory but waits until the memory is actually used before initializing it. This approach alleviates the need to map the pages into memory right away. It also lets the system initialize pages as they’re used, as opposed to all at once.

Using Multiple Malloc Zones

All memory blocks are contained within a malloc zone (also referred to as a malloc heap). All allocations made by malloc function occur within the default malloc zone of the current process, which is created when malloc is first called. Although it is generally not recommended, you can create additional zones if measurements show there to be potential performance gains. For example, if the effect of releasing a large number of temporary (and isolated) objects is slowing down your application, you could allocate them in a zone instead and simply deallocate the zone.

Basic support for zones is defined in /usr/include/malloc/malloc.h. Use the malloc_create_zone function to create a custom malloc zone or the malloc_default_zone function to get the default zone for your application. To allocate memory in a particular zone, use the malloc_zone_malloc , malloc_zone_calloc , malloc_zone_valloc , or malloc_zone_realloc functions. To release the memory in a custom zone, call malloc_destroy_zone.

Warning: You should never deallocate the default zone for your application.

If you are a Cocoa developer, you can also use the NSCreateZone function to create a custom malloc zone and the NSDefaultMallocZone function to get the default zone for your application. To create new objects in a custom zone, use the allocWithZone: class method, which is available to all subclasses of NSObject. If your class does not descend from NSObject, use the NSAllocateObject function to allocate the memory for your new instances. For more information, see the function descriptions in Foundation Framework Reference.

If you are creating objects (or allocating memory blocks) in a custom malloc zone, you can simply free the entire zone when you are done with it, instead of releasing the zone-allocated objects or memory blocks individually. When doing so, be sure your application data structures do not hold references to the memory in the custom zone. Attempting to access memory in a deallocated zone will cause a memory fault and crash your application.

Cache Temporary Buffers

If you have a highly-used function that allocates a large temporary buffer for some calculations, you might want to consider alternative ways to allocate that buffer. Instead of creating a new block of memory each time it’s called, your function could instead cache a buffer initially and reuse that buffer during subsequent invocations. If your function needs a variable buffer space, you can always grow the buffer as needed. For multi-threaded applications, you can attach the buffer pointer to your thread’s context. For single-threaded applications, you can just store the pointer in a global variable.

Caching buffers eliminates much of the overhead for functions that regularly allocate and free large blocks of memory. However, this technique is only appropriate for functions that are called frequently. Also, you should be careful not to cache too many large buffers. Caching buffers does add to the memory footprint of your application. You should be sure to gather metrics for your program with and without the caches to see which yields better performance.

Release Your Memory

Finally, keep in mind the importance of releasing (via the free system routine) all memory that you have allocated with malloc, calloc, or realloc. Neglecting to release memory causes memory leaks, which have a direct impact on performance. To help track down memory leaks, use the MallocDebug application or the leaks command-line tool. Both of these tools are described in “Examining Memory Allocation Patterns.”

Using Handles in Carbon

If you have existing code from Mac OS 9 that you are porting to Mac OS X, you can achieve some performance gains by simplifying your handle-related code. The benefit offered by handles in Mac OS 9 is no longer relevant in applications built for Mac OS X. In particular, there is no need to compact the memory blocks referenced by handles. As a result, your handles never move and there is no need to lock them when you want to access their contents.

If you have code that makes calls to HLock, HUnlock. HSetState, or HGetState, you can either conditionally compile that code out for Mac OS X or you can remove the code entirely. The only exception to this rule is cases where your code calls the SetHandleSize function, which can potentially move a handle if more space is required. If your code needs to access a handle that might be resized at some point, you should lock the handle first.

Copying Memory

There are two main approaches to copying memory in Mac OS X: direct and delayed. For most situations, the direct approach offers the best overall performance. However, there are times when using a delayed-copy operation has is benefits. The goal of the following sections is to introduce you to the different approaches for copying memory and the situations when you might use those approaches.

Copying Memory Directly

The direct copying of memory involves using a routine such as memcpy or memmove to copy bytes from one block to another. Both the source and destination blocks must be resident in memory at the time of the copy. However, these routines are especially suited for the following situations:

the size of the block you want to copy is small (under 16 kilobytes).
you intend to use either the source or destination right away.
the source or destination block is not page aligned.
the source and destination blocks overlap.

If you do not plan to use the source or destination data for some time, performing a direct copy can decrease performance significantly for large memory blocks. Copying the memory directly increases the size of your application’s working set. Whenever you increase your application’s working set, you increase the chances of paging to disk. If you have two direct copies of a large memory block in your working set, you might end up paging them both to disk. When you later access either the source or destination, you would then need to load that data back from disk, which is much more expensive than using vm_copy to perform a delayed copy operation.

Note: If the source and destination blocks overlap, you should prefer the use of memmove over memcpy. Both implementations handle overlapping blocks correctly in Mac OS X, but the implementation of memcpy is not guaranteed to do so.

Delaying Memory Copy Operations

If you intend to copy many pages worth of memory, but don’t intend to use either the source or destination pages immediately, then you may want to use the vm_copy routine. Unlike memmove or memcpy, vm_copy does not touch any real memory. It modifies the virtual memory map to indicate that the destination address range is a copy-on-write version of the source address range.

The vm_copy routine is more efficient than memcpy in very specific situations. Specifically, it is more efficient in cases where your code does not access either the source or destination memory for a fairly large period of time after the copy operation. The reason that vm_copy is effective for delayed usage is the way the kernel handles the copy-on-write case. In order to perform the copy operation, the kernel must remove all references to the source pages from the virtual memory system. The next time a process accesses data on that source page, a soft fault occurs, and the kernel maps the page back into the process space as a copy-on-write page. The process of handling a single soft fault is almost as expensive as copying the data directly.

Copying Small Amounts of Data

If you need to copy a small blocks of non-overlapping data, you should prefer memcpy over any other routines. For small blocks of memory, the GCC compiler can optimize out this routine and replace it with inline instructions to copy the data by value. The compiler may not optimize out other routines such as memmove or BlockMoveData.

Copying Data to Video RAM

When copying data into VRAM, use the BlockMoveDataUncachedfunction instead of functions such as bcopy. The bcopy routine uses cache-manipulation instructions that may cause exception errors. The kernel must fix these errors in order to continue, which slows down performance tremendously.