Screen Shot of Property Place, a property search application in Facebook
Why Drupal was chosen: 

Although the implementation that we're about to describe isn't going to be unique in the Drupal-verse, it still presents how some interesting challenges were overcome which may help future Drupalers when they are faced with large volumes of frequently changing, heavy data. It may also help convince business decision makers to use Drupal in ways they hadn't previously considered.

Describe the project (goals, requirements and outcome): 

Although the implementation that we're about to describe isn't going to be unique in the Drupal-verse, it still presents how some interesting challenges were overcome which may help future Drupalers when they are faced with large volumes of frequently changing, heavy data. It may also help convince business decision makers to use Drupal in ways they hadn't previously considered.


Background and Project Goals

Only local images are allowed.

Property Place is a Facebook application constructed using Drupal 7 that allows users to search, share and sell property via their Facebook account. It offers consumers and selling agents a more intuitive and efficient marketing alternative to existing channels. The application aims to be the number one Facebook property application, and although the app is focused on the UK market at present, it has global ambition and capability.

At a high level, the site takes a data feed from a 3rd party that contains every property available to rent or buy in the UK, cleanses it and imports it into Drupal.


What separates this from a run-of-the-mill Drupal install is the volume of data that we had to contend with.

Data challenges

  • Approximately 400,000 active properties.
  • 3m supplemental items of information, covering things like rooms sizes and other property features.
  • Data churn of approximately 10,000 new properties a day + up to 30,000 updates + 10,000 deletions.
  • 2GB of property data held within CSV format.
  • 350GB of property images held in 2,000,000 different files.
  • The new data is made available twice daily via FTP.

Data cleansing

Due to the volume and quality of data being processed, it was necessary to write a custom import process that sat outside of Drupal and dealt with the initial data load. This process worked by walking sequentially through the CSV file, calculating a hash of each row and comparing them to the hash values we already had within a non-Drupal database for a given property ID. Doing this meant that we were only updating / inserting rows that had changed which is important from a performance perspective. Importing the CSV file into a temporary database also gave us the opportunity to cleanse some of the data held with the CSV where appropriate, and also add new fields that would be useful for the property taxonomy later, such as, price band, property type etc.

The other challenge at this stage in the process was that when a property is removed from sale or rent, then the property is just stripped from the data feed. This meant that without some form of additional cleanup process, then a large amount of orphaned properties would exist on the site. To get around this we needed another process to identify those properties that no longer existed within the feed, and then remove them from the site.

This part of the data cleansing process takes around 30 minutes, without using the hashing approach, it would have taken closer to 5 hours - a reduction of 90%.

Data Import

With the data from the CSV file now held within a local non-Drupal DB, we need to come up with a way to import it into Drupal.


After having had some success with the Feeds module on previous projects whereby we were handling something in the region of 20,000 records with a churn of about 200-500 a day, we initially decided to give it a try on the Property Place project. However, it didn't take us long to realise that we were pushing Feeds to a place that it wasn't designed to go.

One of our primary concerns was that using Feeds required us to create an intermediary data format - by default, Feeds lets you pass data to it in either CSV or XML format (although there now seems to be a new module called feeds_sql that allows you to take data straight from a db). In our early tests Feeds struggled handle the volume of data that we were trying to throw at it. It also became especially frustrating for us when a data import failed, as it was often then necessary to delete all the previously imported nodes, and start afresh instead of being allowed to rollback changes.

Feeds was also quite inflexible when it comes to updating content which was a huge part of this process. Overall it felt like we were trying to make Feeds do things it wasn't designed for.


After realising that Feeds was not the right tool for the job, we decided to try using the another popular data migration / import module in the shape of Migrate. Migrate was developed by a company called Curve and has been used extensively on the migration of several high profile projects to Drupal, such as The Economist and The Examiner.

Migrate differs from other import modules in that you have to write your own migration classes as opposed to relying on web UI to configure an import. Although this requires a bit more effort upfront to understand the module and do something useful with it, it does provide much greater flexibility as you are able to run custom code at the point of import. For example, we were able to do things like find images on the local disk and add them to the property if they existed, and add taxonomy terms based on certain combinations of fields.

One of the particularly useful features of the Migration module is what it calls a highwater field. The highwater field essentially allows you to make and rollback imports based on date. In the context of this project, to leverage the power of this feature we had to add a last modified timestamp column to records in our non-Drupal DB. We then relied on Migrate to compare the last modified timestamp column in our non-Drupal DB, to the Migrate internal record of the last modified date of the last piece of content that was migrated. Anything that's newer then gets imported/updated. To achieve the same effect with Feeds we were having to dynamically create a csv file of new or updated records, in small enough batches so that Feeds could handle it - it was all just getting a bit too messy!

One of the other features that the Migrate module offered us, was that it allowed us to easily collect data from multiple columns and add them to a single multi-valued field in Drupal.

What really sets Migrate apart (from a developer perspective) from the other modules that we looked at, is that it allows you to rollback an import, make changes to your code and then re-import. This is a real time saver when you're developing as you'll often need to alter items on the way in to Drupal, test it, modify the code and test again!

Image handling

Downloading the images

Importing 350GB of images into Drupal using conventional means is a bit of a non-starter, particularly when they are only made available individually, there are some 2 million of them, and the only means of retrieving them is via HTTP.

One of the main issues that we found when attempting to programmatically download that volume of images, is that if you are doing it sequentially using CURL or file_get_contents, then you are subject to timeout errors, false 404s and 500 errors - all of which slow the download process. We calculated that doing it this way would have taken some 8 weeks as we were pulling images down at a painfully slow rate of 5 - 10GB a day despite having a lightening quick Internet connection.

What we needed was a multi-threaded solution. After considering writing some custom php that utilised that CURL's multi-threading ability, and a variety of other options, we realised that there was a far more simple approach. Step forward the trusty unix wget command. By calling it with only a few parameters, we were able max-out our Internet connection and download every image into an organised directory. In the snippet below you can see how we have chained together wget commands so that they run in parallel - here we have shown three, but in practice we use something nearer ten.

sudo wget -x -q -i images.txt &
sudo wget -x -q -i images.txt &
sudo wget -x -q -i images.txt

What each of these statements effectively do is:

  • Run through each of the images in the images.txt file
  • Download each image into a directory structure that is defined by the url.
  • Only download it if the last modified date of the remote file is newer that the one currently held in the local directory structure.
  • Do this quietly.

This process also works for the ongoing image downloads.

High-level data import process summary

After a fairly long process of trial and error we manged to get the daily import time down to around 90 minutes.

  1. Trigger the import process with a cron job.
  2. Get property CSV data from FTP.
  3. Verify data with checksum.
  4. Import data into holding db whilst performing some data cleansing, and supplementing it with new data.
  5. Generate list of properties to be deleted (known by the fact that they were no longer present in the feed).
  6. Generate list of images that are associated with each property
  7. Start the wget image download process
  8. Begin the import using the Drupal Migrate module
  9. Add / update properties

Content Types

The site is split into 2 main content types; one handles the properties and all information relating to them, the other stores information relating to clients (i.e. estate agents).

At the start of the project we had some debate as to what the ideal content type configuration should be, mostly around the properties content type. We didn't know if it was better to have a single properties content type, with references to other nodes that held detail about the rooms along with any specific photo's or notes. This would have meant that the number of nodes held within Drupal would have been in the millions which made us slightly uneasy. In the end we decided to have just one content type for properties, and set it up in such a way that made it possible to store everything that was required - again made possible by the Migrate module.


With the default Drupal search functionality not really being designed to handle the volume of data that the site contained, it was apparent that we needed to interface with an enterprise grade search tool. We needed a search tool that offered us the following key features:

  • Faceted search. This would allow users to filter search results by taxonomy terms that we had set against properties, such as number of bedrooms, price band, house type etc.
  • Advanced sorting. We needed to allow users to perform a search, filter by several items, and then sort the results on values such as price, date added and the like.
  • Fast. We needed to be able to deal with several free-text searches a second.
  • Geo-spatial Search. With all of the properties held within the site having their locations geo-coded, we needed to make sure that the search functionality allowed us to do "distance from" type searches.
  • Scalable. If the volume of traffic meets the forecasts, then we needed to be confident that we could scale the search functionality to meet the performance demands.

We already had Apache Solr in mind at the start of the project, and after some research decided that it was still the right solution for us. Primarily because, it comfortably met the above requirements, had existing Drupal module support (search_api, Search API sorts, Search API ranges, Solr search), had a growing following within the Drupal community.

The auto-suggest functionality was implemented in a slightly custom way. To power this, we took data from several different fields at the time of the import process, and import them into a separate "location" table which the auto complete then looks up on. With this table being indexed, the auto-suggestions happen quickly.


With performance being a very important part of the project, we tried to use the leanest suite of modules that we possibly could. We have mentioned the key modules that we used for the project throughout this case-study so we won't list them all here again, beyond those we used the all the usual ones that most projects call upon at some stage; views, rules, webform etc.

Facebook Integration

When it comes to Facebook module options for Drupal, your choices are fairly limited if you require a shortcut to having a Facebook canvas based application. Having built a fair number of Facebook applications in the past, we knew that the level of control that we required could only be achieved by creating a custom module.

The key bits of functionality that we required for the Facebook integration, was the auto-login, getting the permissions from the user, social plug-ins and custom posts to wall. To meet all these requirements it was apparent that we needed to use the 3 different elements of the Facebook libraries; PHP SDK, JS API & Social Plugins.

PHP SDK 3.11

Using Drupal to power a Facebook application is nothing revolutionary - all you have to do is include the Facebook PHP SDK, perform some authentication and away you go. That wasn't enough for this implementation however as it was necessary to dynamically create a Drupal user account based on their Facebook details to allow for features such as "My properties" and the property upload to work.

When developing for Facebook however, there are a couple of gotchas that you should look out for:

  • You should tweak the settings.php so that cookies expire at the end of the session so that your application session and Facebook session are synced.
  • Be wary of any POST backs from Facebook as they can cause 500 errors under the right circumstances.
  • Make sure that you set the correct headers for you Facebook application (we needed to include header('P3P: CP="CAO PSA OUR"') to prevent a continuous redirection loop in IE)


Given that the Facebook login code was handled by the PHP, all that the Facebook JavaScript library was needed for was the posting to wall functionality which was primarily via the FB.ui function. I have included a snippet of Facebook JS code that we are using within the app that does just this (and have left the comments in so that you can see why developers often become frustrated using the Facebook API!):

//Create a Send dialogue.
// NOTE: You must use display:popup otherwise it will break - known FB bug
// NOTE: You cannot use a Facebook URL in the link otherwise it will break. It needs to be the path that the app sits on.
       method: 'send',
       name: 'Property Place',
       link: window.location.href,
       picture: '',
       description: 'Check out my selection of properties',
       display: 'popup'

There are other basic Facebook JavaScript functions that are pretty much essential for a decent user experience that are probably worth mentioning here, which are:

  • FB.Canvas.setAutoGrow() - this allows the iframe that your application sits within to dynamically resize. If you don't use this function then you will get scrollbars around you application.
  • FB.Canvas.scrollTo(0,0) - forces the page to go to the top of the iframe that the application sits within after every page reload. If you don't use this then after you click a link within the app, the page won't reload with you at the top, it will instead stay in the same position relative to the outer Facebook surround.

Open Graph tags

Properly configured Open Graph tags are crucial if you want peoples shares of your content to be as "rich" as possible. This is best illustrated with an example:

  • Full "Rich" snippet
    Only local images are allowed.
  • Non-rich snippet
    Only local images are allowed.

Open Graph tags aren't difficult to implement thanks to the Meta Tags module. With this installed you can configure it so that the Open Graph tags are dynamically rendered based on fields held within a content type.

Social Plugins

The Facebook social plugins have come on massively in the last 12 months, meaning that the need to use the full Facebook JS API can be consigned to highly customised tasks.

The one gotcha of using the Facebook Social Plugins in AJAX heavy apps, is that if you're returning any content containing Facebook plugin markup, then you need to tell the Facebook library that you've inserted some into the document e.g.

If you return:

<fb:like send="true" width="450" show_faces="true"></fb:like>

Then you need to call the function:



Properties within the feed get displayed on the website normally and without any special prominence, but there was a requirement to allow users to add their properties to the site manually, for a fee. If a user agrees to pay this fee, then they will have a tiered suite of options about how they would like the property to be featured on the site.

Drupal Commerce Integration

The obvious way of allowing this type of transaction to take place on the site is by using Drupal Commerce and a supported payment gateway.

  • Commerce is flexible - you can do just about anything you can think of
  • Paying for the node to be published using custom checkout rules

Facebook Credits

With Property Place being a Facebook application, it made sense that users would also be able to pay for listings on the app using the Facebook Credits currency. Although this functionality isn't quite in place on the application yet, it is planned for a future phase of development.

Hardware & Server Setup

Putting the application on a shared server or budget VPS wasn't really an option. We needed something that was capable of handling the data volume, and traffic forecasts.

We settled on using Amazon's EC2 offering as it seemed like a fairly cost-effective offering and provided scaling features that we could manage ourselves.


At present, the entire application including the Solr server runs off the same server instance. Not great practice from the point of view of sys admin purists, but it works, and saves the cost of an additional server to be setup and maintained.

When the current setup gets to the point whereby it looks to be struggling to cope, we will have a couple of options. One option could be put the site assets onto a CDN which would reduce the number of requests that the server has to deal with. Another option could be to put Apache Solr on it's own server, thus massively reducing the mysql load on the primary server... given that the entire site is pretty much driven off search requests. This will buy us some time until the next upgrade is required. At this point we will then have the option to either throw more CPU cores or RAM at the primary server (thanks to the flexibility of Amazon EC2), or think about scaling horizontally and relying on load balanced servers.


One of the requirements of a Facebook application is that the app sits on a server with a valid security certificate.


They always say that you should worry about performance at the end of the project, rather than getting distracted by it during a build within reason. Build first, refine later. At the end of the project we spent quite a bit of time trying to get page load time down for all key pages on the site.


With so many different modules and systems working together and a less than rapid site loading up in your browser, you need some way of finding out where the bottlenecks in your code are. Short of embarking on semi-educated shotgun debugging mission, it makes sense to employ the help of a tool called xhprof that has been designed by our friends at Facebook to do just this. When xhprof is used in conjunction with Devel, you have a powerful memory profiler on your hands that is capable of pin pointing optimisation-ripe areas of your site.

Our process for speeding up the site was simple; load up a page on the site with xhprof and Devel enabled, look for the listed slow functions, attempt to speed them up, and repeat.

Facebook PHP SDK

One of the major bottlenecks that xhprof identified was the Facebook login code - it was adding some 2 to 3 seconds per page request. The solution to this was simple of course, only attempt to log the user into Facebook at the start of the session, auto-log the user into Drupal using their Facebook credentials that they have just volunteered, and then rely on Drupal sessions to maintain the users session. Logging out is handled by configuring Drupal to end a session at the end of the browser session - this is achieved by updating the following variables in the settings.php file:

ini_set('session.cookie_lifetime', 0);


Although the site doesn't use a CDN at the present time (due to budget limitations), it will be something that we consider should traffic grow to a level where it can be justified.


APC stands for "Alternative PHP Cache". APC is an opcode cache that speeds up your site by caching both PHP code and variables.

Indexing tables

Probably the single most important thing you can do to speed up database queries when you're dealing with large volume of data is making sure that your tables have indexes on. Drupal does this by default on it's core tables, but if you create any database tables outside of Drupal for use with Migrate, then adding indexes will massively speed things up for you.


We chose to base the front end markup on a theme called "Omega Theme". This gave us a lightweight HTML5 based framework that was easy to work with and modify. It also provided a responsive approach that we may be able to build upon for future development work on the site.

We used AJAX requests where appropriate to keep initial page response time to a minimum and enhance the usability of the application.

Implementation Partner Details

Key modules/theme/distribution used: 
Why these modules/theme/distribution were chosen: 


Community contributions: 


Team members: