Memory is an important resource for your application so it’s important to take the time to examine your application’s memory allocation patterns and make changes as necessary.
You can gather a history of your allocations using the Sampler program or using the malloc_history
command-line tool. For more information on analyzing your memory usage, see “Examining Memory Allocation Patterns.”
Memory Allocation in Mac OS X
Tips for Allocating Memory
Copying Memory
Mac OS X implements a highly-tuned, threadsafe allocation library, providing standard implementations of the malloc
,calloc
, realloc
, and free
routines, among others. If you are allocating memory using older routines such as NewPtr
or NewHandle
, you should change your code to use malloc
instead. The end result is the same since most legacy routines are now wrappers for malloc
anyway.
If you are using a custom malloc
implementation, you should consider moving to the system-supplied malloc
routines. The Mac OS X malloc
implementation is highly optimized and fully supports the Apple-provided memory analysis tools. Moving to Apple’s implementation not only gains you the ability to analyze your memory, it lets you remove your custom code from your executable, thus reducing your application footprint.
The following sections provide some details on how the Mac OS X allocation library handles large and small allocations. This information can help you identify the costs associated with each type of allocation. Note that although the following sections talk about the behaviors of the malloc
routine, those behaviors also apply to routines such as calloc
and realloc
.
For allocations of less than a few virtual memory pages, malloc
suballocates the requested amount from a list (or “pool”) of free blocks of increasing size. Any small blocks you deallocate using the free
routine are added back to the pool and reused on a “best fit” basis. The memory pool is itself is comprised of several virtual memory pages and allocated using the vm_allocate
routine.
The granularity of any block returned by malloc
is 16 bytes. Any blocks you allocate will be at least 16 bytes in size or comprised of a block that is a multiple of 16. Thus, if you request 4 bytes, malloc
returns a block of 16 bytes. If you request 24 bytes, malloc
returns a block of 32 bytes.
Note: By their nature, allocations smaller than a single virtual memory page in size cannot be page aligned.
For allocations greater than a few virtual memory pages, malloc
uses the vm_allocate
routine to obtain a block of the requested size.The vm_allocate
routine assigns an address range to the new block in the virtual memory space of the current process but does not allocate any physical memory. Instead, the malloc
routine pages in the memory for the allocated block as it is used.
The granularity of large memory blocks is 4096 bytes, the size of a virtual memory page. If you are allocating a large memory buffer, you should consider making it a multiple of this size.
For large allocations, you may find that it makes sense to allocate virtual memory using vm_allocate
directly. The example in Listing 1 shows how to use the vm_allocate
function.
Listing 1 Allocating memory with vm_allocate
void* AllocateVirtualMemory(size_t size) |
{ |
char* data; |
kern_return_t err; |
// In debug builds, check that we have |
// correct VM page alignment |
check(size != 0); |
check((size % 4096) == 0); |
// Allocate directly from VM |
err = vm_allocate( (vm_map_t) mach_task_self(), |
(vm_address_t*) &data, |
size, |
VM_FLAGS_ANYWHERE); |
// Check errors |
check(err == KERN_SUCCESS); |
if(err != KERN_SUCCESS) |
{ |
data = NULL; |
} |
return data; |
} |
If your code allocates multiple, identically-sized memory blocks, you can use the malloc_zone_batch_malloc
function to allocate those blocks all at once. This function offers better performance than the equivalent series of calls to malloc
to allocate the same memory. Performance is best when the individual block size is relatively small—less than 4K in size. The function does its best to allocate all of the requested memory but may return less than was requested. When using this function, check the return values carefully to see how many blocks were actually allocated.
Batch allocation of memory blocks is supported in Mac OS X version 10.3 and later. For information, see the /usr/include/malloc/malloc.h
header file.
A zone is a variable-size range of virtual memory from which malloc
allocates blocks. All allocations made using the malloc
function occur within the standard malloc zone, which is created when malloc
is first called by your application. You can create additional malloc zones and allocate memory in a specific zone.
Note: The term zone is synonomous with the terms heap, pool, and arena in terms of memory allocation using the malloc
routines.
Zones have the advantage of allowing blocks with similar access patterns or lifetimes to be placed together, theoretically minimizing wasted space or paging activity. You can allocate many objects in a zone and then destroy the zone to free them all. For most developers, however, zones fail to deliver a performance advantage, and you should avoid them unless you need to either track a set of memory blocks separately from other allocations or free many memory blocks quickly, or you have measured a specific case where zones will help.
For information on how to use multiple zones in an application, see “Using Multiple Malloc Zones”
When it comes time to allocate memory for your program, there are other considerations you should make. The following sections provide guidelines on when and how to allocate memory.
Every memory allocation has a performance cost. That cost is measured by the time it takes to allocate the memory and the space occupied by the memory. If you do not need a particular block of memory right away, you should consider deferring its allocation until the first time you actually need it. Once allocated, you can then use it and delete it or cache it for later use.
Applications often allocate memory during initialization and then use that memory later—or sometimes not at all during a given session. Not only does this cause the application to pay an up-front cost for allocating the memory but it does so needlessly. You can easily improve on this costly approach by deferring the allocation to the first time the memory is needed.
For most operations, you can easily arrange your code to use a block of memory right after you allocate it. But if your application uses global variables, you need another way to ensure the memory is there when you need it, but not before. To accomplish this with a minimum of code modification, do the following:
Turn any global variables into static variables so that they are inaccessible to other code modules.
Create a public accessor function to access the static variable and allocate and initialize the buffer for it upon the first invocation.
Listing 2 gives an example of this technique. Code modules that want to access the global buffer call the function to access the pointer.
Listing 2 Lazy allocation of memory through an accessor
MyGlobalInfo* GetGlobalBuffer() |
{ |
static MyGlobalInfo* sGlobalBuffer = NULL; |
if ( sGlobalBuffer == NULL ) |
{ |
sGlobalBuffer = malloc( sizeof( MyGlobalInfo ) ); |
} |
return sGlobalBuffer; |
} |
Note: This code is not safe in the presence of multiple threads. More than one thread could call this function simultaneously, causing the memory to be allocated more than once. To make it threadsafe, add a semaphore lock around the if
statement and any required initialization code.
Memory allocated using malloc
is not guaranteed to be initialized with zeroes. Instead of using memset
to initialize the memory, a better choice is to use the calloc
routine to allocate the memory in the first place.
When you call memset
right after malloc
, the virtual memory system must map the corresponding pages into memory in order to zero-initialize them. This operation can be very expensive and wasteful, especially if you do not use the pages right away.
The calloc
routine reserves the required virtual address space for the memory but waits until the memory is actually used before initializing it. This approach alleviates the need to map the pages into memory right away. It also lets the system initialize pages as they’re used, as opposed to all at once.
All memory blocks are contained within a malloc zone (also referred to as a malloc heap). All allocations made by malloc
function occur within the default malloc zone of the current process, which is created when malloc
is first called. Although it is generally not recommended, you can create additional zones if measurements show there to be potential performance gains. For example, if the effect of releasing a large number of temporary (and isolated) objects is slowing down your application, you could allocate them in a zone instead and simply deallocate the zone.
Basic support for zones is defined in /usr/include/malloc/malloc.h
. Use the malloc_create_zone
function to create a custom malloc zone or the malloc_default_zone
function to get the default zone for your application. To allocate memory in a particular zone, use the malloc_zone_malloc
, malloc_zone_calloc
, malloc_zone_valloc
, or malloc_zone_realloc
functions. To release the memory in a custom zone, call malloc_destroy_zone
.
If you are a Cocoa developer, you can also use the NSCreateZone
function to create a custom malloc zone and the NSDefaultMallocZone
function to get the default zone for your application. To create new objects in a custom zone, use the allocWithZone:
class method, which is available to all subclasses of NSObject. If your class does not descend from NSObject, use the NSAllocateObject
function to allocate the memory for your new instances. For more information, see the function descriptions in Foundation Framework Reference.
If you are creating objects (or allocating memory blocks) in a custom malloc zone, you can simply free the entire zone when you are done with it, instead of releasing the zone-allocated objects or memory blocks individually. When doing so, be sure your application data structures do not hold references to the memory in the custom zone. Attempting to access memory in a deallocated zone will cause a memory fault and crash your application.
If you have a highly-used function that allocates a large temporary buffer for some calculations, you might want to consider alternative ways to allocate that buffer. Instead of creating a new block of memory each time it’s called, your function could instead cache a buffer initially and reuse that buffer during subsequent invocations. If your function needs a variable buffer space, you can always grow the buffer as needed. For multi-threaded applications, you can attach the buffer pointer to your thread’s context. For single-threaded applications, you can just store the pointer in a global variable.
Caching buffers eliminates much of the overhead for functions that regularly allocate and free large blocks of memory. However, this technique is only appropriate for functions that are called frequently. Also, you should be careful not to cache too many large buffers. Caching buffers does add to the memory footprint of your application. You should be sure to gather metrics for your program with and without the caches to see which yields better performance.
Finally, keep in mind the importance of releasing (via the free
system routine) all memory that you have allocated with malloc
, calloc
, or realloc
. Neglecting to releasememory causes memory leaks, which have a direct impact on performance. To help track down memory leaks, use the MallocDebug application or the leaks
command-line tool. Both of these tools are described in “Examining Memory Allocation Patterns.”
If you have existing code from Mac OS 9 that you are porting to Mac OS X, you can achieve some performance gains by simplifying your handle-related code. The benefit offered by handles in Mac OS 9 is no longer relevant in applications built for Mac OS X. In particular, there is no need to compact the memory blocks referenced by handles. As a result, your handles never move and there is no need to lock them when you want to access their contents.
If you have code that makes calls to HLock
, HUnlock
. HSetState
, or HGetState
, you can either conditionally compile that code out for Mac OS X or you can remove the code entirely. The only exception to this rule is cases where your code calls the SetHandleSize
function, which can potentially move a handle if more space is required. If your code needs to access a handle that might be resized at some point, you should lock the handle first.
There are two main approaches to copying memory in Mac OS X: direct and delayed. For most situations, the direct approach offers the best overall performance. However, there are times when using a delayed-copy operation has is benefits. The goal of the following sections is to introduce you to the different approaches for copying memory and the situations when you might use those approaches.
The direct copying of memory involves using a routine such as memcpy
or memmove
to copy bytes from one block to another. Both the source and destination blocks must be resident in memory at the time of the copy. However, these routines are especially suited for the following situations:
the size of the block you want to copy is small (under 16 kilobytes).
you intend to use either the source or destination right away.
the source or destination block is not page aligned.
the source and destination blocks overlap.
If you do not plan to use the source or destination data for some time, performing a direct copy can decrease performance significantly for large memory blocks. Copying the memory directly increases the size of your application’s working set. Whenever you increase your application’s working set, you increase the chances of paging to disk. If you have two direct copies of a large memory block in your working set, you might end up paging them both to disk. When you later access either the source or destination, you would then need to load that data back from disk, which is much more expensive than using vm_copy
to perform a delayed copy operation.
Note: If the source and destination blocks overlap, you should prefer the use of memmove
over memcpy
. Both implementations handle overlapping blocks correctly in Mac OS X, but the implementation of memcpy
is not guaranteed to do so.
If you intend to copy many pages worth of memory, but don’t intend to use either the source or destination pages immediately, then you may want to use the vm_copy
routine. Unlike memmove
or memcpy
, vm_copy
does not touch any real memory. It modifies the virtual memory map to indicate that the destination address range is a copy-on-write version of the source address range.
The vm_copy
routine is more efficient than memcpy
in very specific situations. Specifically, it is more efficient in cases where your code does not access either the source or destination memory for a fairly large period of time after the copy operation. The reason that vm_copy
is effective for delayed usage is the way the kernel handles the copy-on-write case. In order to perform the copy operation, the kernel must remove all references to the source pages from the virtual memory system. The next time a process accesses data on that source page, a soft fault occurs, and the kernel maps the page back into the process space as a copy-on-write page. The process of handling a single soft fault is almost as expensive as copying the data directly.
If you need to copy a small blocks of non-overlapping data, you should prefer memcpy
over any other routines. For small blocks of memory, the GCC compiler can optimize out this routine and replace it with inline instructions to copy the data by value. The compiler may not optimize out other routines such as memmove
or BlockMoveData
.
When copying data into VRAM, use the BlockMoveDataUncached
function instead of functions such as bcopy
. The bcopy
routine uses cache-manipulation instructions that may cause exception errors. The kernel must fix these errors in order to continue, which slows down performance tremendously.
© 2003, 2006 Apple Computer, Inc. All Rights Reserved. (Last updated: 2006-06-28)