Working on an engine for the internet archive [#1289710]

I'm trying to figure out the best way to handle the derivation process at archive.org with the engine api -- the process looks like:

1. Send source file to archive.org
2. Wait some undefined period of time (minutes-days)
3. Once derivatives are available, store as derivatives

I'm running into a challenge with step 1 -- in that hook_media_derivatives_create_derivative expects the actual derivative file to be returned. In this case that file isn't available for quite some time, but it would be nice to be able to kick off the file transfer and encoding process but store the derivatives later when they are available.

I could bypass the entire hook & configuration and use the derivatives_api structure to store derivatives, but I really like the configuration process you're building and would love to not re-create a bunch of that.

Any suggestions would be great. Thanks for all your work on this!

Comment	File	Size	Author
#6	1289710_non_blocking_processing_of_derivatives_6.patch	8.04 KB	slashrsm

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

civicpixel CreditAttribution: civicpixel commented 24 September 2011 at 17:43

The path I'm currently taking is:

1. Handle initial file transfers of source material to archive.org outside of media_derivatives
2. Defined a media_derivative event "new derivation available at archive.org" that fires off as new files related to the source file become available at archive.org
3. Scheduled => Immediate
4. Conditions => matches filetype xyz (ex: mp4)

I think this will work out ok, I'm still not entirely clear on where I'm going to store the file transfer settings. I have some of them being defined by the engine settings, and others in a global settings page.

Comment #2

slashrsm CreditAttribution: slashrsm commented 25 September 2011 at 08:47

That sounds like a tougher one .... :)

My first idea was to change API for 'create derivative' callback a bit, so it would be legal to return something like 'I'm still processing this derivative, please keep it in processing state, since I'll do this in background. I'll inform you when it's done.'.

We would then implement some function you would call, when a derivative is actually created, which would save it and change it's state to 'finished'. Just like media_derivatives_start_encode() does now.

I think there will be some other use cases when we'll need this kind of a 'non-blocking' behaviour, so I'd like to fund a solid solution for this. Do you think this is one, or it's mostly a hack?

Comment #3

civicpixel CreditAttribution: civicpixel commented 30 September 2011 at 22:42

I think that's a good solution -- I planned out what the workflow would then look like for internet archive:

PRESET: IA Transfer

On file_insert for non derivatives internet_archive_media_derivatives_create_derivative would get called and return "still processing"
the internet_archive module would add that file to the transfer queue (drupal queue)
on completion of a successful transfer internet_archive would call your media_derivatives_update_derivative function passing the path and setting the derivative to finished.

PRESET IA Derivative

On archive_insert (media_derivative event defined by internet_archive) internet_archive_media_derivatives_create_derivative would get called and return "still processing"
the internet_archive module would watch for the expected derivative when cron runs
when the derivative completed encoding and was available internet_archive would call your media_derivatives_update_derivative function passing the path and setting the derivative to finished

This would be ideal because I could still store the majority of the derivative information & status using media_derivatives.

Comment #4

slashrsm CreditAttribution: slashrsm commented 16 October 2011 at 21:45

Assigned:	Unassigned	» slashrsm
Category:	support	» feature
Status:	Active	» Needs work

I'll try to code this during next week.

Comment #5

civicpixel CreditAttribution: civicpixel commented 17 October 2011 at 22:13

Thanks for the update -- I have the engine working now in a very basic state at the moment. If you can make accommodation for the above I should be able to get a development release up. I have another issue but I'll post it in a new thread.

Comment #6

slashrsm CreditAttribution: slashrsm commented 19 October 2011 at 20:38

File	Size
1289710_non_blocking_processing_of_derivatives_6.patch	8.04 KB

Attached patch implements it... Please test and let me know if it works as expected.

Engine can now return MEDIA_DERIVATIVE_ENGINE_PROCESSING. This will leave derivative in processing state and wait for engine to end it's job.

When you have your derivative call media_derivatives_derivative_finished() and pass 3 arguments:
- derivative object or MDID
- derivative file object or URI string (this would be returned from engine before this patch)
- (optional) source file object

If an error happend call media_derivatives_derivative_error() and pass 2 arguments:
- derivative object or MDID
- instance of MediaDerivativesException

Comment #7

slashrsm CreditAttribution: slashrsm commented 19 October 2011 at 20:39

Status:

Needs work

» Needs review

Comment #8

camdarley CreditAttribution: camdarley commented 21 October 2011 at 06:54

Maybe Media Derivatives should include Batch Operations in module core, as most of the engine we could use take a large amount of time to process (transcoding, transfering, etc...). We could also have an UI, listing all currents derivatives process.
I was thinking about it for the engine i'm coding, but as it seems to be useful for almost all other engines it doesn't need to be re-coded in each engine.
I'm not experienced enough to propose a working code using queue and batch operations, but civicpixel seems to do good work on that.
What do you think about this?

Comment #9

slashrsm CreditAttribution: slashrsm commented 25 October 2011 at 07:48

I agree. I was planning to support batch jobs from the beginning. One of the reason I did not develop it by now is in lack of experience with batch API. I am not completely sur how this should work. Maybe we can pan this together?

I was never thinking about central list of job, but it sounds like a good idea! I am not completely sur if this can be done. Anyone?