< Previous PageNext Page >

Hide TOC

Efficiently Importing Data

This article describes how you can efficiently import data into a Core Data application and turn the data into managed objects to save to a persistent store. It discusses some of the fundamental Cocoa patterns you should follow, and patterns that are specific to Core Data.

Cocoa Fundamentals

In common with many other situations, when you use Core Data to import a data file it is important to remember "normal rules" of Cocoa application development apply, particularly if you are using a managed memory environment (as opposed to garbage collection). If you import a data file that you have to parse in some way, it is likely you will create a large number of autoreleased objects. These can take up a lot of memory and lead to paging. Just as you would with a non-Core Data application, you can use local autorelease pools to put a bound on how many additional objects reside in memory (for example, if you create a loop to iterate over data you can use an inner autorelease pool that you release and re-create every few times through your main loop). You can also create objects using alloc and init and then release them when you no longer need them—this avoids putting them in an autorelease pool in the first place. For more about the interaction between Core Data and memory management, see “Reducing Memory Overhead.”

You should also avoid repeating work unnecessarily. One subtle case lies in creating a predicate containing a variable. If you create the predicate as shown in the following example, you are not only creating a predicate every time through your loop, you are parsing one.

// loop over employeeIDs

// anID = ... each employeeID in turn

// within body of loop

NSString *predicateString = [NSString stringWithFormat:

        @"employeeID == %@", anID];

NSPredicate *predicate = [NSPredicate predicateWithFormat:predicateString];

To create a predicate from a formatted string, the framework must parse the string and create instances of predicate and expression objects. If you are using the same form of a predicate many times over but changing the value of one of the constant value expressions on each use, it is more efficient to create a predicate once and then use variable substitution (see Creating Predicates). This technique is illustrated in the following example.

// before loop

NSString *predicateString = [NSString stringWithFormat

        @"employeeID == $EMPLOYEE_ID"];

NSPredicate *predicate = [NSPredicate predicateWithFormat:predicateString];

// within body of loop

NSDictionary *variables = [NSDictionary dictionaryWithObject:anID

        forKey:@"EMPLOYEE_ID"];

NSPredicate *localPredicate = [predicate predicateWithSubstitutionVariables:variables];

Reducing Peak Memory Footprint

If you import a large amount of data into a Core Data application, you should make sure you keep your application’s peak memory footprint low by importing the data in batches and purging the Core Data stack between batches. The relevant issues and techniques are discussed in “Core Data Performance” (particularly “Reducing Memory Overhead”) and “Memory Management Using Core Data,” but they’re summarized here for convenience.

Importing in batches

First, you should typically create a separate managed object context for the import, and set its undo manager to nil. (Contexts are not particularly expensive to create, so if you cache your persistent store coordinator you can use different contexts for different working sets or distinct operations.)

NSManagedObjectContext *importContext = [[NSManagedObjectContext alloc] init];

NSPersistentStoreCoordinator *coordinator = /* retrieve the coordinator */ ;

[importContext setPersistentStoreCoordinator:coordinator];

[importContext setUndoManager:nil];

(If you have an existing Core Data stack, you can get the persistent store coordinator from another managed object context.) Setting the undo manager to nil means that:

You don’t waste effort recording undo actions for changes (such as insertions) that will not be undone;
The undo manager doesn’t maintain strong references to changed objects and so prevent them from being deallocated (see “Change and Undo Management”).

You should import data and create corresponding managed objects in batches (the optimum size of the batch will depend on how much data is associated with each record and how low you want to keep the memory footprint). At the beginning of each batch you create a new autorelease pool. At the end of each batch you need to save the managed object context (using save:) and then drain the pool. (Until you save, the context needs to retain all the pending changes you've made to the inserted objects.)

The process is illustrated in the following example, although note that you would typically include suitable error-checking.

NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

NSUInteger count = 0, LOOP_LIMIT = 1000;

NSDictionary *newRecord;

NSManagedObject *newMO;

// assume a method 'nextRecord' that returns a dictionary representing the next

// set of data to be imported from the file

while (newRecord = [self nextRecord])

    // create managed object(s) from newRecord

    count++;

    if (count == LOOP_LIMIT)

        [importContext save:outError];

        [importContext reset];

        [pool drain];

        pool = [[NSAutoreleasePool alloc] init];

        count = 0;

Dealing with retain cycles

There is an additional issue that complicates matters in a managed memory environment (it doesn’t affect applications that use garbage collection). Managed objects with relationships nearly always create unreclaimable retain cycles. If during the import you create relationships between objects, you need to break the retain cycles so that the objects can be deallocated when they’re no longer needed. To do this, you can either turn the objects into faults, or reset the whole context. For a complete discussion, see “Breaking Relationship Retain Cycles.”

Document-based example

The following example illustrates how you could implement a subclass of NSDocumentController to allow your application to open a legacy file format and import the data into a new Core Data store. It assumes three additional methods: openURLForReadingLegacyData:error: to open the legacy file, nextRecord that returns a dictionary containing the data from the next record in the file, and closeLegacyFile to close the legacy file. As a further simplification, it assumes that the legacy file contains data for only one entity.

NSString *CORE_DATA_DOCUMENT_TYPE = @"CoreDataStoreDocumentType";

NSString *ENTITY_NAME = @"MyEntity";

NSUInteger LOOP_LIMIT = 5000;

@implementation MyDocumentController

- (id)openDocumentWithContentsOfURL:(NSURL *)absoluteURL

      display:(BOOL)displayDocument error:(NSError **)outError

    NSString *filePath = [absoluteURL relativePath];

    NSString *fileExtension = [filePath pathExtension];

    NSString *type = [self typeFromFileExtension:fileExtension];

    if ([type isEqualToString:CORE_DATA_DOCUMENT_TYPE])

        return [super openDocumentWithContentsOfURL:absoluteURL

                      display:displayDocument error:outError];

    BOOL ok = [self openURLForReadingLegacyData:absoluteURL error:outError];

    if (!ok)

        return nil;

    NSString *extension = [[self fileExtensionsFromType:CORE_DATA_DOCUMENT_TYPE]

                                     objectAtIndex:0];

    NSString *storePath = [[filePath stringByDeletingPathExtension]

                                     stringByAppendingPathExtension:extension];

    NSFileManager *fm = [NSFileManager defaultManager];

    if ([fm fileExistsAtPath:storePath])

        ok = [fm removeItemAtPath:storePath error:outError];

        if (!ok)

            return nil;

    NSURL *storeURL = [NSURL fileURLWithPath:storePath];

    NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"MyDocument"

                                                 ofType:@"mom"];

    NSURL *modelURL = [NSURL fileURLWithPath:modelPath];

    NSManagedObjectModel *model = [[[NSManagedObjectModel alloc]

            initWithContentsOfURL:modelURL] autorelease];

    NSPersistentStoreCoordinator *psc = [[[NSPersistentStoreCoordinator alloc]

            initWithManagedObjectModel:model] autorelease];

    NSPersistentStore *store = [psc addPersistentStoreWithType:NSSQLiteStoreType

            configuration:nil URL:storeURL options:0 error:outError];

    if (store == nil)

        return nil;

    NSManagedObjectContext *importContext = [[NSManagedObjectContext alloc] init];

    [importContext setPersistentStoreCoordinator:psc];

    [importContext setUndoManager:nil];

    NSEntityDescription *entity = [NSEntityDescription entityForName:ENTITY_NAME

            inManagedObjectContext:importContext];

    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

    NSUInteger count = 0;

    NSDictionary *newRecord;

    NSManagedObject *newMO;

    while (newRecord = [self nextRecord])

/*

         It is more efficient to cache the entity and use initWithEntity:

         insertIntoManagedObjectContext: than to use the typically-more-convenient

         NSEntityDescription method insertNewObjectForEntityForName:inManagedObjectContext:.

         It is also more efficient to release the new managed object than it is to add it

         to the autorelease pool.  It can safely be released here since, because

         it's a newly-inserted object, the managed object context retains it.

*/

        newMO = [[NSManagedObject alloc] initWithEntity:entity

                       insertIntoManagedObjectContext:importContext];

        [newMO setValuesForKeysWithDictionary:newRecord];

        [newMO release];

        count++;

        if (count == LOOP_LIMIT)

            ok = [importContext save:outError];

            if (!ok)

                [importContext release];

                [pool drain];

                [self closeLegacyFile];

                return nil;

/*

             reset is not actually needed in this example since we're not creating

             any relationships but this would be one place to put it if we were

             [importContext reset];

or:

             for (NSManagedObject* mo in [importContext registeredObjects])

                 [importContext refreshObject:mo mergeChanges:NO];

*/

            [pool drain];

            pool = [[NSAutoreleasePool alloc] init];

            count = 0;

    ok = [importContext save:outError];

    [self closeLegacyFile];

    [importContext release];

    [pool drain];

    if (!ok)

        return nil;

    return [super openDocumentWithContentsOfURL:storeURL

                  display:displayDocument error:outError];

Implementing Find-or-Create Efficiently

A common technique when importing data is to follow a "find-or-create" pattern, where you set up some data from which to create a managed object, determine whether the managed object already exists, and create it if it does not.

There are many situations where you may need to find existing objects (objects already saved in a store) for a set of discrete input values. A simple solution is to create a loop, then for each value in turn execute a fetch to determine whether there is a matching persisted object and so on. This pattern does not scale well. If you profile your application with this pattern, you typically find the fetch to be one of the more expensive operations in the loop (compared to just iterating over a collection of items). Even worse, this pattern turns an O(n) problem into an O(n^2) problem.

It is much more efficient—when possible—to create all the managed objects in a single pass, and then fix up any relationships in a second pass. For example, if you import data that you know you does not contain any duplicates (say because your initial data set is empty), you can just create managed objects to represent your data and not do any searches at all. Or if you import "flat" data with no relationships, you can create managed objects for the entire set and weed out (delete) any duplicates before save using a single large IN predicate.

If you do need to follow a find-or-create pattern—say because you're importing heterogeneous data where relationship information is mixed in with attribute information—you can optimize how you find existing objects by reducing to a minimum the number of fetches you execute. How to accomplish this depends on the amount of reference data you have to work with. If you are importing 100 potential new objects, and only have 2000 in your database, fetching all of the existing and caching them may not represent a significant penalty (especially if you have to perform the operation more than once). However, if you have 100,000 items in your database, the memory pressure of keeping those cached may be prohibitive.

You can use a combination of an IN predicate and sorting to reduce your use of Core Data to a single fetch request. Suppose, for example, you want to take a list of employee IDs (as strings) and create Employee records for all those not already in the database. Consider this code, where Employee is an entity with a name attribute, and listOfIDsAsString is the list of IDs for which you want to add objects if they do not already exist in a store.

First, separate and sort the IDs (strings) of interest.

// get the names to parse in sorted order

NSArray *employeeIDs = [[listOfIDsAsString componentsSeparatedByString:@"\n"]

        sortedArrayUsingSelector: @selector(compare:)];

Next, create a predicate using IN with the array of name strings, and a sort descriptor which ensures the results are returned with the same sorting as the array of name strings. (The IN is equivalent to an SQL IN operation, where the left-hand side must appear in the collection specified by the right-hand side.)

// create the fetch request to get all Employees matching the IDs

NSFetchRequest *fetchRequest = [[[NSFetchRequest alloc] init] autorelease];

[fetchRequest setEntity:

        [NSEntityDescription entityForName:@"Employee" inManagedObjectContext:aMOC]];

[fetchRequest setPredicate: [NSPredicate predicateWithFormat: @"(employeeID IN %@)", employeeIDs]];

// make sure the results are sorted as well

[fetchRequest setSortDescriptors: [NSArray arrayWithObject:

        [[[NSSortDescriptor alloc] initWithKey: @"employeeID"

                ascending:YES] autorelease]]];

Finally, execute the fetch.

NSError *error;

NSArray *employeesMatchingNames = [aMOC

        executeFetchRequest:fetchRequest error:&error];

You end up with two sorted arrays—one with the employee IDs passed into the fetch request, and one with the managed objects that matched them. To process them, you walk the sorted lists following these steps:

Get the next ID and Employee. If the ID doesn't match the Employee ID, create a new Employee for that ID.
Get the next Employee: if the IDs match, move to the next ID and Employee.

Regardless of how many IDs you pass in, you only execute a single fetch, and the rest is just walking the result set.

The listing below shows the complete code for the example in the previous section.

// get the names to parse in sorted order

NSArray *employeeIDs = [[listOfIDsAsString componentsSeparatedByString:@"\n"]

        sortedArrayUsingSelector: @selector(compare:)];

// create the fetch request to get all Employees matching the IDs

NSFetchRequest *fetchRequest = [[[NSFetchRequest alloc] init] autorelease];

[fetchRequest setEntity:

        [NSEntityDescription entityForName:@"Employee" inManagedObjectContext:aMOC]];

[fetchRequest setPredicate: [NSPredicate predicateWithFormat: @"(employeeID IN %@)", employeeIDs]];

// make sure the results are sorted as well

[fetchRequest setSortDescriptors: [NSArray arrayWithObject:

        [[[NSSortDescriptor alloc] initWithKey: @"employeeID"

                ascending:YES] autorelease]]];

// Execute the fetch

NSError *error;

NSArray *employeesMatchingNames = [aMOC

        executeFetchRequest:fetchRequest error:&error];