Drupal Migrate using XML in 0 to 35

Using Drupal Migrate is a great way to move you content into Drupal. Unfortunately the documentation for XML import can be obscure. This comes about when those that developed the module try to communicate how they did what they did to someone that did not do the work. Things that seem obvious to them are not to someone else.

I have spent some time recently importing content using XML. In no way am I an expert that is speeding down the fast lane, something more in the cruising around town at a comfortable 35 mph.

To use Drupal Migrate you need to define your own class. A class is php code that is used in Object Oriented Programming that defines your data and defines how you can manipulate your data. Most of the actual migration work is done with the classes provide by the migrate module, you simply have to define the details of your migration.

Constructor - The constructor modifies the migration modules classes to define your specific data. I was able to follow the SourceList method, this provides one XML (file or feed) that contains the ID number for all the content you want to import, and a second (file or feed) that contains the content. The wine example migrate has this but understanding what it really wants is more difficult to understand.

Let's start the file:

* @file
* Vision Article migration.

Below is my class file explained:

* Vision Article migration class.
class VisionArticleMigration extends XMLMigration {
public function __construct() {
$this->description = t('XML feed of Ektron Articles.');

So far pretty easy. You need to name your class, extend from the proper migration. and give it an extension.

// There isn't a consistent way to automatically identify appropriate
// "fields" from an XML feed, so we pass an explicit list of source fields.
$fields = array(
'id' => t('ID'),
'lang_type' => t('Language'),
'type' => t('Type'),
'image' => t('Image'),
'authors' => t('Authors'),
'article_category' => t('Article Category'),
'article_series_title' => t('Article Series Title'),
'article_part_no' => t('Article Series Part Number'),
'article_title' => t('Article Title'),
'article_date' => t('Article Date'),
'article_display_date' => t('Article Display Date'),
'article_dropheader' => t('Article Dropheader'),
'article_body' => t('Article Body'),
'article_author_name' => t('Article Author Name'),
'article_author_url' => t('Article Author Email Address'),
'article_authors' => t('Article Additional Authors'),
'article_postscript' => t('Article Postscript'),
'article_link_text' => t('Article Link text'),
'article_link' => t('Article Link'),
'article_image' => t('Article Image'),
'article_image_folder' => t('Article Image Folder'),
'article_image_alt' => t('Article Image Alt'),
'article_image_title' => t('Article Image Title'),
'article_image_caption' => t('Article Image Caption'),
'article_image_credit' => t('Article Image Credit'),
'article_sidebar_element' => t('Article Side Bar Content'),
'article_sidebar_element_margin' => t('Article Margin between Sidebar Content'),
'article_archived_html_content' => t('Article HTML Content from old system'),
'article_video_id' => t('Article ID of Associated Video Article'),
'metadata_title' => t('Metadata Title'),
'metadata_description' => t('Metadata Description'),
'metadata_keywords' => t('Metadata Keywords'),
'metadata_google_sitemap_priority' => t('Metadata Google Sitemap Priority'),
'metadata_google_sitemap_change_frequency' => t('Metadata Google Sitemap Change Freequency'),
'metadata_collection_number' => t('Metadata Collection Number'),
'title' => t('Title'),
'teaser' => t('Teaser'),
'alias' => t('Alias from old system'),
'taxonomy' => t('Taxonomy'),
'created_date' => t('Date Created')

So what does this mean?

You will need a field name below. It has nothing to do with your XML file, you will need a field for each thing you want to import. Such as article_image_alt is the alt text for the image. Later you will define the xpath to load this variable. This will start to come together below, just remember each unique piece of information needs a variable.

// The source ID here is the one retrieved from the XML listing URL, and
// used to identify the specific item's URL.
$this->map = new MigrateSQLMap($this->machineName,
'ID' => array(
'type' => 'int',
'unsigned' => TRUE,
'not null' => TRUE,
'description' => 'Source ID',

This has to do with setting up the migration table in the database. This has to do with the input database, the Source ID is the field in the input file that has the pointer to the data record. My source file looks like:


So we need a table with a field for the id which an integer.

// Source list URL.
$list_url = 'http://www.vision.org/visionmedia/generateexportlist.aspx';
// Each ID retrieved from the list URL will be plugged into :id in the
// item URL to fetch the specific objects.
// @todo: Add langtype for importing translated content.
$item_url = 'http://www.vision.org/visionmedia/generatecontentXML.aspx?id=:id';

// We use the MigrateSourceList class for any source where we obtain the
// list of IDs to process separately from the data for each item. The
// listing and item are represented by separate classes, so for example we
// could replace the XML listing with a file directory listing, or the XML
// item with a JSON item.
$this->source = new MigrateSourceList(new MigrateListXML($list_url),
new MigrateItemXML($item_url), $fields);
$this->destination = new MigrateDestinationNode('vision_article');

Now we are setting up the magic. We setup a list url that contains the ID's of all the content to import, then another one that uses this ID to fetch the details for this ID. Then you tell Migrate to use the MigrateListXML to find the items to import with MigrateItemXML. Then finally in the MigrateDestinationNode to tell Migrate which content type to use. This means we need a separate migration class for each content type to import. I have been creating each class in it's own inc file and adding this to the files section in the info file.

// TIP: Note that for XML sources, in addition to the source field passed to
// addFieldMapping (the name under which it will be saved in the data row
// passed through the migration process) we specify the Xpath used to retrieve
// the value from the XML.
$this->addFieldMapping('created', 'created_date')

Now we map the source field with the destination field. Created is the field name in the content type (vision_article), created_date is from our fields section above. Remember I said we needed a definiation for each part of the content we want to import. The xpath then points to the data in the XML feed. So this says take the content of the /contnet/CreateDate in the XML file and load this into the source variable created_date, then store this in the created field in a new vision_article content item. I say this in this way because if you do like me and cut and paste and forget to change the source varable, the source varable will contain the bottom data from xpath.

$this->addFieldMapping('field_category', 'article_category')

You can set a default value in case the XML does not contain any data

$this->addFieldMapping('field_series_title', 'article_series_title')
$this->addFieldMapping('field_part_number', 'article_part_no')
$this->addFieldMapping('field_h1_title', 'article_title')
->arguments(array('format' => 'filtered_html'))
$this->addFieldMapping('field_display_date', 'article_display_date')
$this->addFieldMapping('field_drophead', 'article_dropheader')
->arguments(array('format' => 'filtered_html'))

Another field argument, the default content type is plain text, so if your content contains HTML you need to set the correct format here.

$this->addFieldMapping('body', 'article_body')
->arguments(array('format' => 'filtered_html'))
$this->addFieldMapping('body:summary', 'teaser')
->arguments(array('format' => 'filtered_html'))

Note you can set the teaser as a part of the body. One of the drush migrate commands make is easy to discover the additional parts of your content field, drush mfd (Migrate Field Destinations). This will display all the destination fields and their options.

$this->addFieldMapping('field_author', 'article_author_email')
$this->addFieldMapping('field_author:title', 'article_author_name')
$this->addFieldMapping('field_ext_reference_title', 'article_postscript')
->arguments(array('format' => 'filtered_html'))

See explanation below:

->defaultValue(MigrateFile::FILE_EXISTS_REUSE); //FILE_EXISTS_REUSE is in the MigrateFile class
$this->addFieldMapping('field_article_images', 'article_image')
$this->addFieldMapping('field_article_images:source_dir', 'article_image_folder')
$this->addFieldMapping('field_article_images:alt', 'article_image_alt')
$this->addFieldMapping('field_article_images:title', 'article_image_title')

This section gets tricky. You are importing an Image or other file. The default migration for a file is MigrateFileUrl. You can migrate all your files ahead of time or as I am doing do it inline. The main components for this is the main field, which is the file name, and the source_dir for the path to this image. Drual 7 has a database table for the files is uses with the url to the file. MigrateFile then uploads this file to the public folder and creates an entry into the files_,amaged table to indicate the url. What I did was copy all the images to a public location on S3 storage so I did not want Migrate to create a new file but use the existing file. Thus the file_replace setting to the constant MigrateFile::FILE_EXISTS_REUSE. This tells migrate to use the existing file and make an entry in the file_managed table for this file.

Later in the PrepareRow method I will show how we separate this and add it to the XML.

$this->addFieldMapping('field_archive', 'article_archived_html_content')
$this->addFieldMapping('field_ektron_id', 'id')
$this->addFieldMapping('field_ektron_alias', 'alias')
$this->addFieldMapping('field_sidebar', 'article_sidebar_element')
->arguments(array('format' => 'filtered_html'))
->defaultValue(MigrateFile::FILE_EXISTS_REUSE); //FILE_EXISTS_REUSE is in the MigrateFile class
$this->addFieldMapping('field_slider_image', 'image')
$this->addFieldMapping('field_slider_image:source_dir', 'image_folder')
$this->addFieldMapping('field_slider_image:alt', 'image_alt')
$this->addFieldMapping('field_slider_image:title', 'image_title')
$this->addFieldMapping('title', 'title')
$this->addFieldMapping('title_field', 'title')
// Declare unmapped source fields.
$unmapped_sources = array(

If you are not using a source field, best practices state that you declare it in the unmapped sources

// Declare unmapped destination fields.
$unmapped_destinations = array(

If you are not using a destination field best practices state that you declare in the unmaped destinations array. Note if you later use this field you need to remove it from the unused array.

if (module_exists('path')) {
if (module_exists('pathauto')) {
if (module_exists('statistics')) {
$this->addUnmigratedDestinations(array('totalcount', 'daycount', 'timestamp'));

The rest of the constructor is from the example. Did not cause me a problem so did not worry about it.

* {@inheritdoc}

Now we can add our own magic. We can effect the data from the content item before it is saved in to the content item.

public function prepareRow($row) {
if (parent::prepareRow($row) === FALSE) {
return FALSE;
$ctype = (string)$row->XML->Type;
//set variable for return code
$ret = FALSE;

You will see these scattered through the prepareRow function. These are the devel command to print to the screen for debuging. They should be commented out but you can see the process I went through to debug my particular prepareRow. Also note this is a great use of the Migrate UI, these print statment only help you in the web interface, if you use Drush you will not see these diagnostic prints.

if ($ctype == '12'){

This is specific to my migrate. The following code is only applicable to a content type of 12. The other content types have a different data structure. If prepareRow returns False the row will be skipped.

// Map the article_postscript source field to the new destination fields.
//if((string)$row->XML->root->article->Title == ''){
// $row->XML->root->article->Title = $row->XML->root->Title;
$postscript = $row->XML->html->root->article->Postscript->asXML();
$postscript = str_replace('','',$postscript);
$postscript = str_replace('','',$postscript);
$row->XML->html->root->article->Postscript = $postscript;

Again this is something unique to my migrate. The content structure is contained in XML so the HTML is recognized by SimpleXML as XML. So the asXML() function returns a string containing the XML of the node. Now I can save this string to the node and it becomes a string node and is back to straight HTML. So I need to do this for all the nodes that contain HTML. Most of the time you will be able to pass the HTML string as a node and will not have to do this transform.

//converts HTML nodes to string so they will load.
$body = $row->XML->html->root->article->Body->asXML();
$body = str_replace('','',$body);
$body = str_replace('','',$body);
$row->XML->html->root->article->Body = $body;
$title = $row->XML->html->root->article->Title->asXML();
$title = str_replace('','',$title);
$title = str_replace('','',$title);
$row->XML->html->root->article->Title = $title;
$drophead = $row->XML->html->root->article->Dropheader->asXML();
$drophead = str_replace('','',$drophead);
$drophead = str_replace('','',$drophead);
//If Dropheader is empty
$drophead = str_replace('','',$drophead);
$row->XML->html->root->article->Dropheader = $drophead;
//Array to allow conversion of Category text to IS
$cat_tax = array(
'Science and Environment' => 1,
'History' => 2,
'Social Issues' => 3,
'Family and Relationships' => 4,
'Life and Health' => 5,
'Religion and Spirituality' => 6,
'Biography' => 7,
'Ethics and Morality'=> 8,
'Society and Culture' => 9,
'Current Events and Politics' => 10,
'Philosophy and Ideas' => 11,
'Personal Development' => 12,
'Reviews' => 13,
'From the Publisher' => 14,
'Interviews' => 17,
//Convert additional taxonomies to tags
//$tax_id_in = (string)$row->XML->Taxonomy;
//$tax_id_array = explode(',',$tax_id_in);
//$tax_in_array = array();
//foreach($tax_id_array as $tax){
// If(is_null($cat_tax[tax]))
// $tax_in_array[] = $cat_tax[$tax];
//$new_tax = implode(',',$tax_in_array);
//$row->XML->Taxomomy = $new_tax;
// Change category text to ID
$category = (string)$row->XML->html->root->article->Category;
//Specify unknown category if we do not recognize the category
//This allows the migrate and allow us to fix later.
$tax_cat = $cat_tax[trim($category)];
if(is_null($tax_cat)) {$tax_cat = 18;}
$row->XML->html->root->article->Category = $tax_cat;

The category field in the source is a text field. The categories are a entity reference to a taxonomy field, which requires an id rather than text. I manually setup the categories ahead of time so I created an array that has the text as the key and the is as the content. Then you can use this to quickly look up the id for the text in he category field. Then we can replace the text in Category with the id. This works, another way to do this is migrate the categories first then use this migration to translate this for you. This is a feature built into migrate. The explanation of this will come later.

//modify the image file node.
if((string)$row->XML->html->root->article->Image->File->asXML() != ''){
$src = (string)$row->XML->html->root->article->Image->File->img->attributes()->src;
$src_new = str_replace('/visionmedia/uploadedImages/','http://assets.vision.org/uploadedimages/',$src);
$row->XML->html->root->article->Image->File->img->attributes()->src = $src_new;
$file_name = basename($src_new);
$file_path = rtrim(str_replace($file_name,'', $src_new), '/');;

There is alot of stuff here. Remember for the MigrateFile you need to present the file name and source directory. The Image/File node contains an img tag. So we need to get the scr attribute and extract the file name and source directory. So why the if? Migrate will import a null node as null, but this is php code running on the row. If you try to get the src attribute on a null node it will throw an error. So the if statement checks to see if the File node is empty (only contains /File) and skips this tranformation, Migrate will simply import a null or empty field.

The src is the relative path to the website, so the first thing we do is change this to full url to the s3 content storage. The path is basically the same except in the uploadedimages the i in the database is uppercase. This was a Windows server so it did not make a difference but the s3 url is case sensitive. We then use base name to extract the file name and use this to remove the file from the path for the file path and create a new child in the XML row to store these. I did not point this out but this is the xpath use in the field mapping above.

$email = (string)$row->XML->html->root->article->AuthorURL;
if (!empty($email)){
$email = 'mailto:'.$email;
$row->XML->html->root->article->AuthorURL = $email;

The author url is the email to the author of the article. We turn this into a mailto link so that it will generate a link to send the author an email.

$archive_html = (string)$row->XML->html->asXML();
$sidebar_element = (string)$row->XML->html->root->article->SidebarElement->SidebarElementInformation->asXML();
$row->XML->html->root->article->SidebarElement->SidebarElementInformation = $sidebar_element;
$slider_src = (string)$row->XML->Image;
$slider_src_new = str_replace('/visionmedia/uploadedImages/','http://assets.vision.org/uploadedimages/',$slider_src);
$row->XML->Image = $slider_src_new;
$slider_file_name = basename($slider_src_new);
$slider_file_path = rtrim(str_replace($slider_file_name,'', $slider_src_new), '/');;

The rest is repitition of the above techniques. Note that we return TRUE if we want to process the row and false if we do not want to process the row.

//Need to add processing for other Article Content types especially 0 (HTML content)
return $ret;

This is the class I use for one of the imports. I told you that I would show the use of another migrate in the field mappings. Below is a snippet of code from the issues migration. The issue contains entity reference to vision_articles that were imported from above.

$this->addFieldMapping('field_articles', 'article_id')

So this says use the VisionArticle (I will show you were to find this next), it knows to look up the source ID and relate it to the DestinationID and store this in the field_articles field.

Migrate has been around for a while. Initially they said that the class would automaticall be registed and you could manually register them if needed. Then they changed to say that they will not manually register and you should register your classes. So you should have as part of your migration module the following that will register your classes. Note the name of the array element is the name used above.

function vision_migrate_migrate_api() {
$api = array(
'api' => 2,
// Give the group a human readable title.
'groups' => array(
'vision' => array(
'title' => t('Vision'),
'migrations' => array(
'VisionArticle' => array('class_name' => 'VisionArticleMigration'),
'VisionIssue' => array('class_name' => 'VisionIssueMigration'),
'VisionVideoArticle' => array('class_name' => 'VisionVideoArticleMigration'),
'VisionFrontpage' => array('class_name' => 'VisionFrontpageMigration'),
return $api;

I hope this makes things a little easier to understand. You will need some basic module building skills, knowing the file names and things like that, but this should help you through the more obscure parts of creating your migration class.