We are moving our website to Drupal, but it isn’t easy!

As I work my way through the whole process, I would like some help. It would be great if I could have small number of Drupal users – who understand Drupal’s inner working well – to whom I can call on with specific questions. I don’t want to communicate with many users, as this would become tedious and slow.

In return, I would like to offer to the Drupal community detailed information, by way of example, of what I have done from start to finish. I will provide my php programs, well documented, in the hope that they can be modified as needed by others encountering similar challenges. This should probably become part of the Migrating to Drupal documentation. I would like, when I am finished, to also add to the documentation of the Drupal database tables and fields. I hope that the volunteers who come forward will also be willing to critique my contribution.

The major problem is that we are currently using a third party’s proprietary CMS (which I won’t identify, as it isn’t relevant), for which conversion tools are not available. Being proprietary, we also cannot ascertain the database structure, and in any case, we don’t know the credentials to access the database. But we can spider the site and save the pages as straight HTML files. So the problem consists of grabbing the appropriate pieces from the spidered backup and placing these into the appropriate table files in Drupal.

The site is bilingual (English and French), so there is the issue of language control.

There are also unusual items that perhaps have to be dropped in the conversion to Drupal. For example, some simple HTML/CSS is included in some of the page headings (allowing for italics when referencing book titles, or superscripts, etc.). For now, I am throwing this formatting away, but prefer to be able to keep it.

Moving a site provides an opportunity to restructure it, and clean it up. Therefore there is a requirement to map old content to the new.

Some of these issues I have successfully tackled. Some are problematic. For example, it is unclear how the options field in table menu-links, which is defined as a blob, is used or should be filled in. Leaving this with the default value of null often resulted in runtime errors. The Drupal tables are not all documented, and the documentation that is provided is minimal.

I have developed some programs to start me on the process, and am updating and adding to these as I go. They are designed to do everything from start to finish if run in sequence. This allows me to continually update the existing site, and rerun the process with the updated information, or to recover when I do something incorrectly.

I also use this as an opportunity to catch errors in the content. Many people have contributed content directly to the site, and they haven’t always followed the rules that we have established. For example, a colon on our French pages must always be preceded by a non-breaking space, but these are sometimes missing.

Many thanks in advance. I look forward to the challenges!


nevets’s picture

Though the code is specific to another source, it may be helpful to look at https://drupal.org/project/example_web_scraper. You might also want to look at https://drupal.org/project/feeds_crawler and https://drupal.org/project/feeds_spider.

Side note, you really want to use Drupal API's when adding content.

ebrandon’s picture

I appreciate your comments, nevets.

Regarding spidering We already spider several of our sites using Wget, which captures more sites than just the one I'm converting. We have several sites, on different platforms and hosted in different shops. This spidering works fine, and gives us all the live pages as HTML files. And it has helped me in the past. For example, I have been known to update French pages with English text, etc. The spidered backup lets me get things back when I prove how human I am.

I have no problem puling the information I need out of these files. This includes the page title, page contents, page format (we use a few different templates), and language (English or French). This content I have placed into a database table, from which I can carry out the next steps.

Next steps I have put a fair amount of effort into looking for errors in the content. Sometimes I fix these directly in my programs, and sometimes I display a message, so that I can check if this really needs to be changed. I have already identified over 50 things that I need to look into in this manner, and frequently update the current, non-Drupal site to address these problems. The next night's spidering is then used to provide a new set of page contents to the complete process. So this is iterative, I fix things as I find them, and use the cleaned up pages to rerun everything.

This is all done with programs I've written in PHP.

Loading the Drupal database I then run a few programs that load this information into the Drupal tables. Contents go into tables field_data_body and field_revision_body. Page information goes into tables node and node_revisions. This works fine too, and I see the results when I call up the new site (currently only available internally where I work). Indeed, this works with the default theme (Garland), but of course not all formatting is correct. Unfortunately, our designer hasn't finished her work, so I haven't yet started on theme building for the site. But this will not affect what I have done so far.

The next step is to tackle the hierarchy, which I do by updating table menu_links. This is not yet 100% successful. I found that if I did not supply a value to the options field in this table, Drupal would give me a run-time error message. Looking at this field in other Drupal sites that we do have, I found it contains binary code. I changed my program to add this, instead of leaving this field defaulted to NULL, and I no longer get the message. But this is working blindly, and I was unable to locate any documentation to describe how this field is used.

I also created friendly URLs in table url_alias. In this one case, I may have done too much work. The result is that I can switch from English to French pages OK, but no longer from French to English.

Re-inventing the wheel I strongly believe in using what is available. But my experience is that it doesn't cover this task very well. I hope that I'm wrong, and that someone can point me to specifics. You refer to APIs. So far I have found nothing to help me with our site, but sure would like to have specifics. Overall, my experience is that while there is a huge quantity of both documentation and modules, there isn't anything that really applies to the task of building a site from a non-Drupal site, carrying over the pages (title, contents, navigation, etc.). I would really like to be proven wrong, but I would need specifics. This is the help I'm really looking for.

And as I said earlier, I will be very pleased to put my complete experiences into documentation in Drupal, for the benefit of others. It's the least I can do!

nevets’s picture

For content it is better to use node_prepare() and node_save() instead of writing to the database directly.

naziaali’s picture

Great article!