Running drush hosting-task outputs weird drush packets, for example:

aegir@marcos:~/hostmaster-HEAD/profiles/hostmaster$ drush @hostmaster hosting-task 296
DRUSH_BACKEND:{"type":"message","message":"Changed group ownership of <code>sites\/\/private\/temp<\/code> to www-data","timestamp":1351798138.4491,"memory":13152528,"error":null,"packet":"log"}DRUSH_BACKEND:{"type":"notice","message":"Exception Object\n(\n    [message:protected] => We were unable to find any installation profiles. Installation profiles tell us what modules to enable and what schema to install in the database. A profile is necessary to continue with the installation process.\n    [string:Exception:private] => \n    [code:protected] => 0\n    [file:protected] => \/home\/aegir\/platforms\/drupal-7.14\/includes\/\n    [line:protected] => 1034\n    [trace:Exception:private] => Array\n        (\n            [0] => Array\n                (\n                    [file] => \/home\/aegir\/platforms\/drupal-7.14\/includes\/\n                    [line] => 473\n                    [function] => install_select_profile\n                    [args] => Array\n                        (\n                            [0] => Array\n                                (\n                                    [parameters] => Array\n                                        (\n                                            [profile] => \n                                            [locale] => \n                                            [q] => \n                                        )\n\n                                    [settings_verified] => 1\n                                    [forms] => Array\n                                        (\n                                            [install_settings_form] => Array\n                                                (\n                                                    [driver] => mysql\n                                                    [database] => test1orangeseeds\n                                                    [username] => test1orangeseeds\n                                                    [password] => XXXXXXXX\n                                                    [host] => localhost\n                                                    [port] => 3306\n                                                    [db_prefix] => \n                                                )\n\n                                            [install_configure_form] => Arr

That's an install task.


Steven Jones’s picture

Yeah, we are still seeing these in the testbot logs, and when testing manually in-fact, looks like we need to do some bug hunting in Drush itself.

Steven Jones’s picture

It's really odd indeed, there are logs like this:

Deleting /var/aegir/platforms/openatrium/modules/translation directory successful. [613.5 sec, 13.86 MB]                                                              [message]
ssage","message":"Deleting \/var\/aegir\/platforms\/openatrium\/modules\/statistics directory successful.","timestamp":1357595302.3709,"memory":14500144,"error":null,"packet":"log"}

Which looks like the start of the packet was lost somehow?

Steven Jones’s picture

Title: hosting-task outputs drush packets » Drush invoke process can't handle large packets of data
Project: Hostmaster (Aegir) » Drush
Version: 6.x-2.x-dev » 7.x-5.x-dev
Component: User interface » Base system (internal API)
Priority: Normal » Major

This is indeed a Drush issue, so moving to the Drush queue.

Here's where the bug is: in _drush_backend_proc_open it checks to see if there's anything to read from the child processes, and then when there is it reads the next 4096 characters from the stream. If the packet of data written to the STDOUT in the child process was bigger than 4096 characters, then that packet basically gets logged to the screen, not parsed by Drush, and then the remaining characters are also read and logged to the screen.

Simple 'solution' would be to increase 4096 to something larger so that larger packets can be read, but I think we'll just run into an issue somewhere else, or the reading will happen in the middle of a large packet being transferred, and so we'll be back in this weird position.

The code needs to be smarter and if it can't parse the backend packets it needs to buffer them up and appending that string to the next chunk of data it gets I think.

greg.1.anderson’s picture

Assigned: Unassigned » greg.1.anderson

drush_backend_parse_packets assumes that every packet fits inside the 4096-byte packet size. The solution suggested at the end of #3 is correct; if a packet contains "\0DRUSH_BACKEND:", but does not contain the trailing null, then we should accumulate more data from the stream and try again. More specifically, since it is also possible that the "DRUSH_BACKEND:" might be truncated, the code should also insure that any \0 that occurs within 14 characters or so of the end of a packet will also cause the output to be re-scanned on the next iteration.

The code is already doing this:

$bucket[$site]['output'] .= $string;

In the existing implementation, packets are removed from $string before it is appended to 'output', and 'output' is never edited. Perhaps it would be better to append $string to 'output' prior to removing packets, and edit them out as they are processed. If we then remember the last offset of 'output' that we scanned for packets, we wouldn't need to re-search the whole thing on every iteration. If there are no nulls in 'output', then the last offset would equal the length of the output buffer at the end of each iteration. If there is an unterminated \0DRUSH_BACKEND, then the last offset will be stuck there until the packet is terminated, and if there is any other null in the output, then the offset will be stuck there until there are at least 14 bytes more in the output.

Will give this a go when I have a chance; in the meantime, patches welcome.

Steven Jones’s picture

Status: Active » Needs review
3.37 KB

Here's a very initial stab at this, fixes the original issue of packets not being parsed but being logged, but shifts the logging of non-packets to the eof for that process. I guess that may on each iteration it could check to see if there's something that looks like a packet in the output (maybe a single null character?) and if there isn't then logs the output as before.

greg.1.anderson’s picture

Status: Needs review » Needs work

That's not a bad start, and it even looks as if you do log packets incrementally (although I did not test to confirm). However, I am still concerned that there may be memory and performance issues with scanning the output on each iteration. Increasing the buffer size could help here, if so. Could you either measure and confirm that this performs similarly to the old code (ideally by first dumping a lot of text data before logging), or adjust to record the last found packet position to reduce scanning?

greg.1.anderson’s picture

Version: 7.x-5.x-dev » 8.x-6.x-dev
Status: Needs work » Needs review
Issue tags: +Release blocker

Here's an alternate implementation that should avoid re-parsing large output chunks and should also maintain the incremental log message output behavior from the existing code. I have not tried with large packets, but existing test cases pass.

If you could provide a unit test that verifies this works with large packets, then I think it's probably ready to go in.

Tagging as 'Release blocker' so that Jonhattan can decide if he wants a backport of this for the pending Drush-5.9 release.

greg.1.anderson’s picture

5.83 KB

Oh, and here is the patch.

greg.1.anderson’s picture

Here is an updated patch with a functional test that confirms that split packets can be re-assembled. This patch adds a new capability to backend invoke that allows the caller to specify a backend option '#process-read-size' that will change the block size that backend invoke packets are processed in, if desired. (If not specified, the previous default of 4096 is still used.) The test case sets this option to a very small value, so that all packets will be split. With the split-packet handling code removed, this test fails. When it is present, it passes.

Please confirm this works with Aegir. If it does, I recommend committing this to Drush-8.x-6.x and also Drush-7.x-5.x for the upcoming Drush-5.9 release.

Steven Jones’s picture

Is the introduction of the null character backportable? Will the backport have to remove the additional of the null character?

greg.1.anderson’s picture

I think that the issue of backporting the addition of the null character is subject to discussion. Note that:

1. If the null is removed from the output stream, the existing code will still run, but will be vulnerable to an intermittent bug: drupal backend packets will be fine, but if the final backend output marker is split across a 4096-byte boundary, then the final function result will be lost, with unpredictable results.

2. The existing code uses 'trim' when processing the backend result, so the null disappears in this function.

3. The existing code does not use 'trim' when processing incremental results, so the null may be printed to the console.

Note that although the null may be printed to the console (to no effect, I believe), it never appears in machine-parsed data structures. Also, even if the null is not backported, existing versions of Drush-5.x will still see the null, to the same effect described above, when talking to Drush-6.x with this patch applied.

I therefore recommend backporting this to 5.x with the addition of the null in the output stream. Note also that Drush 5.9 is scheduled to come out soon (around "the end of this week" == today or in the next day or two), and this is the only remaining bug under consideration for inclusion -- all others have been committed. If this fix is desired in Drush 5.9, please provide prompt feedback so that we can drive this to completion. n.b.: It may still be decided to defer this to Drush 5.10, depending on the opinions of other maintainers.

Steven Jones’s picture

Initial testing looks okay to me, applying this patch and running Aegir's test-suite now to test the original issue.

Steven Jones’s picture

Status: Needs review » Needs work
Issue tags: -Release blocker

Sadly I still see 'DRUSH_BACKEND' in the output with this patch applied, so there must be something subtle going on that's not quite addressed by this patch. I'll try to work out what, but I'd suggest that this issue shouldn't block a release.

anarcat’s picture

Assigned: greg.1.anderson » anarcat

I'll take a look at this sucker shortly if you guys don't mind.

Steven Jones’s picture

Yeah sorry, I've not been able to get to the bottom of this one.

greg.1.anderson’s picture

I haven't had time to return to this since #9; it would be great if you could take a look.

anarcat’s picture

So here's an example of the behavior I am seeing:

Cache HIT cid: 5.8-commandfiles-0-5d22b7f7ed2b58f346cba08552ce3933       [debug]
[0.01 sec, 3.67 MB]

DRUSH_BACKEND:{"type":"bootstrap","message":"Found command: provision-install-backend (commandfile=provision)","timestamp":1359741621.8004,"memory":12609552,"error":null,"packet":"log"Loaded alias @server_localhost from file /var/aegir/.drush/server_localhost.alias.drushrc.php [4.95 sec, 12.06 MB]                                                                                                                     [notice]

Notice the weird spacing and the DRUSH_BACKEND bullcrap in there...

anarcat’s picture

So after looking at this for about an hour just got me completely and utterly confused.

I am not sure the bug I am seeing is related to the 4096 byte limit.

I am also having trouble simply reproducing the issue outside of Aegir - there isn't a lot of ways to reproduce this bug short of writing full unit tests with the output. Hairy. This is what I will probably end up doing. I have done this:

aegir@marcos:/usr/share/drush/commands/provision$ php /usr/share/php/drush/drush.php  --backend=2 --root= --verbose --debug  provision-save ''   --context_type=site --master_url= --db_server='@server_localhost' --platform='@platform_Drupal717' --language=en --profile=minimal --client_name=anarcat 2> /tmp/packets.log > /tmp/output.log

Then I will try to build a test case for drush_backend_parse* functions by feeding them those output files, which seem to be causing problems, or at least irregular output, in Aegir.

While looking at unit tests, I also noticed that they are themselves spewing out a lot of packets to stderr, example:

anarcat@marcos:tests$ phpunit --filter=testBackendFunctionResult .
PHPUnit 3.7.13 by Sebastian Bergmann.

Configuration read from /home/anarcat/src/drush/tests/phpunit.xml.dist

DRUSH_BACKEND:{"type":"bootstrap","message":"Bootstrap to phase 0.","timestamp":1359749156.9175,"memory":8090968,"error":null,"packet":"log"}

DRUSH_BACKEND:{"type":"bootstrap","message":"Found command: php-eval (commandfile=core)","timestamp":1359749156.9253,"memory":8095616,"error":null,"packet":"log"}

DRUSH_BACKEND:{"type":"notice","message":"Command dispatch complete","timestamp":1359749156.9326,"memory":8104064,"error":null,"packet":"log"}

DRUSH_BACKEND:{"type":"memory","message":"Peak memory usage was 8.94 MB","timestamp":1359749156.9328,"memory":8104928,"error":null,"packet":"log"}

Log: Executing: /usr/bin/drush --nocolor php-eval 'return '\''bar'\''' --backend

Time: 0 seconds, Memory: 12.25Mb

OK (1 test, 2 assertions)

This looks fairly harmless, but I think it's actually an error - stderr should be eaten by backend_invoke...

I am fairly confused now, and my test environment here isn't quite up to speed, so I'll try to work more on this next week...

Steven Jones’s picture

To reproduce, I created two drush commands, one took a single argument, an integer, and outputted a string of that length. Then the other command, invoked the first, passing an argument through, so it was quite easy to test outside of Aegir.

I'll try to remember to dig out the code and post it here.

anarcat’s picture

Status: Needs work » Reviewed & tested by the community

Regardless of my above comment, it seems that the patch *DOES* improve the output. I was able to find out what the error was with my wailing task while I couldn't do it before.

The fact that DRUSH_BACKEND packets still show up is a minor inconvenience.. The only concern I have with the current patch is that it complicates code that is already FUBAR'd for me, but it depends whether our priority is to get things working or understandable. :P

Otherwise, let's just commit this, at least to fix the regression in D5.

greg.1.anderson’s picture

Status: Reviewed & tested by the community » Fixed

Went ahead and committed this per #20. If anyone has more to add on any of the observed odd output, please feel free to re-open.

Committed in cc8b76b.

Steven Jones’s picture

Version: 8.x-6.x-dev » 7.x-5.x-dev
Status: Fixed » Patch (to be ported)

I think that this needs backporting to 5.x too?

greg.1.anderson’s picture

Status: Patch (to be ported) » Fixed

Committed on 7.x-5.x branch. Please re-open if any problems are observed.

Steven Jones’s picture

Thanks, we shall test this with Aegir's testing scripts, which seemed to show the issue quite clearly.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

jhedstrom’s picture

There appears to be a bit of a regression in the output for all make commands after this commit. #1982502: Reduce complexity of backend invoke; consider removing concurrency.