Migrate user data

📘

Important

We should emphasize that the data migration script relies on the entity.bulkCreate operation to create user accounts. That's important, because this operation does not generate webhook events: if you use the entity.bulkCreate operation to create 10,000 user accounts you won't get any webhook notifications. That's because entity.bulkData doesn't generate entityCreated events; as a result, there's nothing for Webhooks v3 to report. Among other things, this means that, whenever you do a data migration, you don't have to worry about disabling any webhook subscriptions that look for the entityCreated event.


To run the dataload.py script (and to migrate your data) you need to include several of the command line arguments shown in the following table (note that some of these arguments are optional and some are required):
Named ParameterRequiredDescription
-h, --helpDisplays the parameters that can be used with dataload.py. For example:

python3 dataload.py -h

There is no need to include other parameters when using -h. In fact, if you do include additional parameters those parameters will be ignored and the help information will be displayed. This, by the way, is the same help you see if you call the script without any arguments.
-u, --apid_uriThe URI to your Identity Cloud Capture domain. For example:

-u https://educationcenter.us-dev.janraincapture.com

You can find the URI to your Capture domain by looking at the Manage Application page in the Console.
-i, --client-idClient ID for the API client used to do the data migration. For example:

-i 382bgrkj4w28984myp7298pzh35sj2q
-s, --client-secretClient secret for the API client used to do the data migration. For example:

-s b2gfp7mgk9332annghwcf0po57xzqht5
-k, --config-keyReserved for ​Akamai​ internal use.
-d, --default-clientReserved for ​Akamai​ internal use.
-t, --type-nameName of the entity type that user records should be written to. For example:

-t user

If not specified, the script defaults to the user entity type.
-b, --batch-sizeNumber of records to be included in each batch; the default value is 10. For example:

-b 20
Batches represent the number of records sent with each call to the entity.bulkCreate operation.
-a, --start-atRecord number (i.e., line number within the CSV file) where the migration process should start; the default value is 1. For example:

-a 100

This parameter is typically used if a previous import failed at an identifiable point in the process (e.g., if the first 99 records were successfully imported before network connectivity was lost).
-w, --workersTotal number of worker threads; the default value is 4. Adding threads can speed up the data migration process. For example:

-w 6
-o, --timeoutAmount of time that can elapse, in seconds, before an API call times out; the default value is 10. For example:

-o 5

It’s recommended that you never set the API timeout to be greater than 10 seconds. As a general rule, it’s better to change the batch size than it is to change the timeout interval.
-r, --rateMaximum number of API calls that can be made per second; the default value is 1. For example:

-r 2

If you receive a 510 (rate limit) error while running the script, use this parameter to reduce the maximum number of API calls that can be made per second.
-x, --dry-runRuns through the full data migration process, but without copying records from the legacy data file to the user profile store. Note that you must include all the required parameters in order to successfully complete a dry run. For example:

python3 dataload.py -u https://educationcenter.us-dev.janraincapture.com -i 382bgrkj4w28984myp7298pzh35sj2q -s b2gfp7mgk9332annghwcf0po57xzqht5 -x test_users.csv
-m, --delta-migrationPerforms the migration as a delta migration, which overwrites any matching records already in the user store.
-p, --primary-keyUsed during a delta migration. Specify an existing attribute in the target Entity Type to identify duplicate records that will be updated. The attribute must be unique in the schema (the default primary key is email).

For example, suppose you use email as you primary key and your CSV file includes a record that has the email address karim.nafir@mail.com. If dataload.py finds an existing user profile record that has that email address, the existing record will be replaced by the record found in the CSV file.
Positional  ParameterRequiredDescription
Path to the datafile containing the user records being migrated to the user profile store. This parameter should be the last parameter in your command-line call. For example:

python dataload.py -u https://educationcenter.us-dev.janraincapture.com -i 382bgrkj4w28984myp7298pzh35sj2q -s b2gfp7mgk9332annghwcf0po57xzqht5 test_users.csv

When all is said and done, a call to dataload.py will look something like this:

python3 dataload.py -u https://educationcenter.us-dev.janraincapture.com -i 382bgrkj4w28984myp7298pzh35sj2q -s b2gfp7mgk9332annghwcf0po57xzqht5 test\_users.csv

Two things to keep in mind here. First, command-line arguments have both a short name and a long name; these two partial commands are identical:

python3 dataload.py -u https://educationcenter.us-dev.janraincapture.com
python3 dataload.py --api_url https://educationcenter.us-dev.janraincapture.com

Second, there’s no datafile command line argument. Instead, the datafile is simply the last item in the command:

python3 dataload.py -u https://educationcenter.us-dev.janraincapture.com -i 382bgrkj4w28984myp7298pzh35sj2q -s b2gfp7mgk9332annghwcf0po57xzqht5 test\_users.csv

If the datafile is anywhere except at the very end of the command, your script will run and you won’t get any error messages. However, no data will be copied to the user profile store.

Incidentally, you can run dataload.py as many times as you want; there’s nothing to prevent you from doing that. Is there any reason why you’d even want to run dataload.py on multiple occasions? We can think of at least two possibilities:

  • By using the “dry run” option, you can “practice” running your script, and running it against your actual datafile a million times (or more) without ever copying any data to the user profile store. That’s a good way to practice data migration, and to identify and resolve problems before you do everything for real. You should perform several dry runs before trying to do the actual migration.

    On a related note, you can do as many dry runs as you want (and need), but sooner or later you’ll have to migrate some real data to the real user profile store. When that time comes, we don’t recommend that, on your first try, you attempt to migrate all 9 million of your user accounts. Instead, you might want to migrate 3 or 4 user accounts, and make sure that all the fields can be copied over successfully. If so, then you can take try 50 or 100 accounts, and do the same thing. When you’re fully confident that the process is working, then you can copy over the entire datafile.

  • If you have multiple legacy systems, you might want to copy them one at a time. For example, suppose you have separate systems for users in North and South America, users in Europe, and users in Africa and Asia. Instead of combining the CSV files, you might choose to do separate migrations, one for each legacy system.


How long does a data migration take?

Akamai supports the import of 10,000 records per minute when using dataload.py, which equates to 600,000 records per hour. These upper limits can be helpful in planning, but your actual run time can vary depending on the complexity of the records being imported. For example, records with a lot of plural data take longer to process. If you find your records-per-minute average is below 10,000, you can try and tweak performance by changing the following arguments:

  • -b BATCH_SIZE
  • -w WORKERS
  • -r RATE_LIMIT

Be careful not to allow too many API calls (-r) per second, as your Akamai Identity Cloud APIs are limited in the number of calls that can be made in one minute, and are limited in the number of concurrent calls at any given time. Note that this measurement includes all traffic to your Akamai Identity Cloud instance, not just API calls from dataload.py. Because of that, you might want to plan your migrations to coincide with periods of non-peak traffic.

A rough calculation that can be used is this:

BATCH_SIZE x RATE_LIMIT x 60 = Number of records per minute.

This calculation assumes that the API response time from entity.bulkCreate is less than WORKERS/RATE_LIMIT. A larger BATCH_SIZE generally means a higher API response time. More attributes per record and the inclusion of complex structures like plurals will also increase API response time.

The best strategy to improve your dataload performance is to start by migrating a small sample of test records that are very similar to the format of your actual records and keeping track of how long the migration process takes. (It’s recommend that you perform test migrations in a non-production environment.) After the first test, adjust the arguments noted above and repeat as necessary.