Distributed downloads

Last updated on
24 April 2025

Proposed by: Wim Leers

UPDATE:

Originally I intended this as my own project proposal, but - unexpectedly - I was given the opportunity for working on a new startup, which is of course even better than SoC. If somebody would like to continue with this proposal, the SoC mentors may decide whom is the best candidate.

Benefits

  • Faster downloads
  • Lower hosting costs (also possible: share bandwidth with "affiliate sites")
  • Lower chance of downtime

Concept

Suppose you have a Drupal-powered site myfirstsite.com, which provides (fairly large) dowwnloads to its visitors. This server would then be the MS ("Master Server") for all files on that site.
Now suppose you also have some other sites, mysecondsite.com and mythirdsite.com. And you have a friend whose site (siteofafriend.com) got 50 GB of unused montly bandwidth. This module will allow you to use these sites (mysecondsite.com, mythirdsite.com, siteofafriend.com) as download servers, thus effectively reducing the amount of files your own site (myfirstsite.com), must serve. These servers wold be the PSes ("Participating Server").
How do files get synced from the MS to the PSes? And how do you set that only files larger than 1 MB would be mirrored by the PSes, or let only files not tagged as "confidential" not be mirrored? For that part, we will rely on already existing modules: the Publish module and the Subscribe module.
First install this module and the Publish module on the MS. Then install Drupal on the PSes (if it isn't installed already) this module and the Subscribe module. Create a new "channel" (with filters if you want that) and set the correct permissions on the MS, using the Publish module. Then subscribe to that channel on the PSes, with the correct authentication credentials. Content will now be automatically synced.

Server selection

When a user starts a download, the best server for him/her will be selected before starting the download. Criteria are, in order of importance:

  1. user has access to file
  2. server has requested file
  3. server is online
  4. server load is not higher than treshold (slots?)
  5. server has bandwidth left
  6. shortest server-client distance

(Let me know if you think I forgot one.)

The download URL would stay the same (the MS's one). But when the download link is clicked, the criteria above will be checked. The chosen server (if a PS, otherwise the MS will simply start sending the file) will then be notified a certain IP is downloading a file and that this IP will have sufficient permissions to download the file. The user will be redirected to a unique URL on the PS and the download will be started automatically there.
(Note that this means this system will also be self-load-balancing!)

Use cases

  • Screencasts.
  • Archives (zip/rar/7-zip/...) of photos, documents, ...
  • Software: applications, Linux distributions, ...

Notes

  • The Publish/Subscribe modules currently do not support syncing of files, so this proposal would include adding that functionality to those modules. Also, they're not yet Drupal 5 compatible. So it may be necessary to update that as well.
  • Alternatively, it may be better to NOT rely on the Publish/Subscribe modules and build the "mirroring code" directly into the new module. This would also allow us NOT to create new nodes for mirrored files. IMHO this is the more desired behaviour, but that's open for discussion. It also results in less system overhead for the mirroring sites. And it would allow us to encrypt the XML-RPC calls, but I'm not sure if that's a necessity.
  • Write this module as a FileAPI filesystem?
    It should be decided by the mentors just before the project starts whether the new filesystem/fileapi modules are in a shape to base this project on them. It's definitely the preferred way.

FAQ

Q: How do other projects do this? Sourceforge has a mirror selection routine, and their code is open.... what do they do? Is your proposal aiming to be functionally similar to what I see on Sourceforge?
A: SourceForge is simple mirror selection, i.e. manual for the client, automated for the servers (using rsync probably). My proposed module would automate it for both the client and the server, and would be "the Drupal way".

Help improve this page

Page status: Not set

You can: