Data transformations and data validations
Dataload.py (with an assist from transformations.py) can convert data from one format to another; for example, you can convert a Yes to a True, or convert the date 7/12/1967 to 1967-07-12. As weโve seen, itโs generally pretty easy to carry out data transformations as part of the data migration process. What isnโt so easy (at least not without a bit of effort on your part) is carrying out data validations.
Letโs explain what that means. If youโve registered with an โAkamaiโ-powered website or if youโve created a user profile in the Console, then you know what data validation is all about. For example, suppose you try to register a new account on an โAkamaiโ Identity Cloud website, and you forget to enter a display name. Instead of creating the account, the registration process stops and tells you that something is missing:
Thatโs an example of data validation: before even trying to create a new account the registration form has checked to see that everything is in order (e.g., that all the required fields have been filled out). If everything isnโt in order, the process stops dead in its tracks.
This is also an example of what the data migration process doesnโt do: dataload.py wonโt verify that youโve entered values for all the required fields, and it wonโt verify that the data that you did enter is correctly formatted (for example, the script doesnโt ensure that an email address looks similar to karim.nafir@mail.com). We just saw that you canโt register on an โAkamaiโ-powered website without including a display name; on top of that, the display name you do enter must be unique. If it isnโt, youโll be prevented from creating an account:
However, after doing a trial data migration, we ended up with several users who donโt have display names:
We also have three recently-migrated users who have the same display name:
How could those things happen? Those things can happen because, as we noted earlier, the data migration process doesnโt validate the data: to a very large extent it simply copies over whatever data you give it. If you donโt list a display name for a user then that user wonโt have a display name. And if you list the same display name for 25 users, well โฆ.
Hereโs something else you need to know. Itโs possible to import a million users who all have the same display name (we donโt recommend it, but you can do it). You can also import users who donโt have an email address:
However, if you try to import a user who has an email address thatโs already in the system, that user account will not be copied over to the user profile store. Instead, the record will be skipped, and youโll see an entry like this in the fail.csv log:
batch,line,error
1,2,Attempted to update a duplicate value
So whatโs the deal here? Didnโt dataload.py do a data validation in this particular case?
Believe it or not, no, it didnโt: dataload.py never does data validations. Instead, the underlying user profile schema performed a data validation (and, as a result, would not allow the record to be written to the profile store). If you look at the schema (or at least at the schema we used for our data migration test), youโll see that the email attribute is not required, but that it is globally unique:
In other words, and as far as the schema is concerned, you donโt have to have an email address, but, if you have one, that address has to be unique. As for displayName, the schema doesnโt flag that attribute as either required or unique:
Because of that, the records that we import donโt have to have a display name and, if they do have a display name, that display name doesnโt have to be unique. Something to be aware of.
If you need data validations, youโll either have to write custom code that can perform those validations or work with your Akamai representative to see if those validations can be placed on your schema.
Updated over 1 year ago