Splitting this off from #397458: Revamp mailing logic to leverage flag module and #34496: [meta] Add Flag module to allow users to subscribe/unsubscribe without posting a comment

Once all the other pieces are in place (which is very close to being done) we need a data migration path where for every user in the system, for all existing content that the user posted or commented on, we add the flag to "follow" that content.

If tracker2 is enabled, this could perhaps be done very quickly by leveraging the tracker2 tables (since those are already recording all the nodes that a user wrote or commented on).

However, not all project_issue sites are using tracker, so we're going to need a real data migration path regardless.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

dww’s picture

Forgot to tag

hunmonk’s picture

Assigned: Unassigned » hunmonk

couple of observations while researching to implement this:

  1. this could perhaps be done very quickly by leveraging the tracker2 tables

    tracker2 does index all nodes, even if it's enabled after some content is created, so this approach seems feasible. flag keeps track of both individual flags and a summary of flag counts per content_id. it would be nice to just run two queries to dump all the necessary data into the respective tables, which i think is possible. we will need to be careful about duplicate keys in the case that flag was already being used on a site, so i'm supposing the best thing to do is check {flag_content} to make sure no project_issue nids exist. summarizing, we'd only use the tracker2 shortcut if a) it's enabled, b) it's done indexing all nodes, and c) no project_issue nodes are in {flag_content}

  2. we'd of course want a fallback to just iterate through all issues if tracker2 shortcut isn't feasible on a site.
  3. in either case, we'll need to know the flag to use before we can add the data. flag module provides a way to programmatically create flags; i'm now thinking that this would be a much better approach than manually adding the flag before running the update.
  4. at this point, i'm thinking we'd run an initial update function to perform the necessary tests to see if the tracker2 shortcut can be taken, then store that result in the update session. further updates can test that value and execute/not execute as necessary.
hunmonk’s picture

just realized that because we'll be creating our own flag programmatically, we don't have to worry about checking {flag_content} for duplicates, which makes the test case for using tracker2 that much simpler.

Lars Toomre’s picture

Are flags in project issue tracking module always going to be defined via programmatic methods? It seems a shame to lose the flexibility to set flag(s) via the administrator UI and will slow down implementation when we next need to deploy another flag.

dww’s picture

@Lars: Thanks for caring about this and sharing your thoughts, but I don't really understand your concerns:

- Flags defined programmatically can still be overridden in the flag admin UI.

- I don't think project_issue isn't going to hard-code that only its flag can be used for this feature -- we're still going to let you use a different flag if you really want to. We're just going to provide a default that's going to work out of the box.

- If anything, having the flags we need defined in code will speed up deployment, since we don't want d.o deployment processes that look like "enable flag.module, go to the admin UI and click together a flag for this, ...". The more these settings are captured in code, in version control, etc, the easier and more reliably we can deploy them.

Lars Toomre’s picture

Thanks for the clarification. I misunderstood what @humonk was referring to. Let me know how I might help get this feature implemented.

hunmonk’s picture

Status: Active » Needs work

i've run a bit aground on the approach outlined in #2:

flag module records the timestamp that something is flagged. unfortunately, using the tracker2 approach the best we have access to is the {node}.changed time of the entire issue, which is certainly not accurate.

a couple of the api functions in flag make use of this timestamp value, and i'm not liking the idea of flying free and easy on this when we actually can provide that value using the project_issue approach (with {node}.created and {project_issue_comments}.timestamp).

i want to get some feedback before i proceed here, because this represents a major change in the approach.

dww’s picture

Status: Needs work » Active

The tracker2 thing was just meant to be a possible d.o optimization. Either way, we definitely need the other case working.

Furthermore, after pondering a bit and chatting with killes, I think we should do this data migration while d.o is still live -- since we're using tracker2 for all the issue queue views and the tracker page, the main things that would be broken without the flag data are e-mail notifications and the state of the "Follow" vs. "Following" UI on the issue pages for each user (all the work at #1284694: Tweak UI of issue following (bluecheese) -- much of which I'm planning to fold into project_issue itself). We can wait to deploy the new follow UI until the data migration is completely done. And we could even leave the old e-mail logic in place to send emails (basically temporarily hard-code it to ignore the follow flag setting and pretend we're still in the no-flag case for those hunks of code in mail.inc -- another good reason to leave flag support optional in there).

Therefore, please carry on getting real data and ignoring tracker2.

Thanks!
-Derek

dww’s picture

FYI:

- I created a new feature branch for flag integration: "34496-flag-integration"
- I finished up #397458: Revamp mailing logic to leverage flag module and merged the old feature branch into that, then deleted the old branch
- I just pushed a commit into the 34496-flag-integration branch for programmatically defining the default flag:
http://drupal.org/commitlog/commit/1894/387b07f0ff0cb830702180ac00ff9485...

Hopefully that'll help while you're working on the data migration stuff.

Thanks,
-Derek

hunmonk’s picture

dww, quicksketch and i had a conversation today about the different dynamics involved in this data migration, which i'll attempt to summarize here, given the three approaches that seemed at all feasible:

  1. add each flag individually using the flag API:
    the flag API doesn't currently allow timestamps to be specified when flagging a piece of content, it's hard-coded to the time the flag is inserted into the database. this is problematic because we'd like to create a proper flagging history, and weird timestamps would screw up any views that deal with showing a user's flagging history, and it would be prohibitively difficult to 'play back' the entire issue creation history to even just get the timestamp order right. furthermore, when content is flagged, flag module calls a hook_flag() to allow other modules to react. if we bypass using the flag API to add the flags, impact from not calling hook_flag() will probably be insignificant to sites that already have flag installed and running, and no impact at all to sites that are installing flag to use with project_issue. however, we cannot know exactly what any particular site is doing with hook_flag. i estimate that this approach would take about 6 million queries, would not get the timestamps right, but would ensure that there were no problems with missing calls to hook_flag().
  2. add each flag individually using the straight db queries:
    this will allow us to work around the timestamp issue as we can just use the {node}.created and {project_issue_comments}.timestamp values as we create the flags. it will also be less total queries than using the API. i estimate 2.5 million queries, correct timestamps/history, and i *think* we can (if we want) call hook_flag() ourselves to ensure that sites with flag already installed will not have any issues.
  3. leverage data already stored in tracker2:
    all the mappings for flagged content except the timestamp are stored in a properly indexed {tracker2_user} table. if we don't care about getting the historical order of timestamps right, we can just set them all to the same value after we stuff in the data from tracker2. this will only take 3 (rather expensive) queries, will not give us proper timestamps, and will not offer any support for hook_flag().
dww’s picture

Thanks for the great summary! Feel free to put that in the issue body.

I'm strongly in favor of #2, and I know killes and nnewton will prefer it too. Quicksketch was also in agreement.

Cheers,
-Derek

hunmonk’s picture

Status: Active » Needs review
FileSize
3.49 KB

attached is the data migration path using the "add each flag individually using straight db queries" approach discussed above.

fairly well tested, seems to handle all normal use cases. the update does nothing unless flag is installed and a follow flag is set. it only adds flags for each unique user of an issue, and ignores anonymous user posts (doesn't matter on d.o, but might on other sites). flag counts are updated as well, and all proper timestamps are being inserted into the db.

i did not mess with also calling hook_flag() manually, as flag module passes an instantiated flag object as an argument for the hook, and that's a bit of a black box to me. if quicksketch comes along and helps with that, then we might be able to throw that in there as well...

dww’s picture

I haven't tested, but a quick visual inspection makes me wonder:

A) Can't $flag->type be an array of node types? If we're hard-coding 'project_issue' in the query against {node} why not just use $node->type or continue to hard-code 'project_issue'?

B) Why separate queries to insert each node flag? why not a single query to do all the issue flags in one pass? Something like this:
db_query("INSERT INTO {flag_content} (fid, content_type, content_id, uid, sid, timestamp) SELECT %d, type, nid, uid, 0, created FROM {node} WHERE type = 'project_issue' AND status = 1", $follow_flag);
?

Sure, we still have to handle the counts separately, but this seems like it's going to save a ton of separate queries.

C) For that matter, why not the same with comments? Can't we do something similar to the above? I guess the big problem is about avoiding duplicates? Can't we let the DB sort that out instead of doing it in PHP? Maybe it doesn't matter.

D) What about new content created between when this batch update is initialized and when it completes (assuming we're not going to take d.o offline while this runs)? Aren't we going to miss those? Or is the idea that we deploy the new code that's auto-following new content so that new flags might be getting inserted while this is all happening?

Thanks,
-Derek

hunmonk’s picture

A) it's not the node type, it's flag's version of the 'content type', which looks to just be 'node'.

B) it needs to take into account not creating flags for issues created by an anonymous user, but on quick inspection it looks like that idea would probably work. we can pick up the flag counts properly when we do the comments.

C) as i've noted in the code comments, we need to deal with not inserting duplicates, as well as properly using the first time somebody commented on an issue as the time they flagged it. i know of no bulk query that will do that.

D) i'm opposed to the idea of running this update while the site is live, and this is a perfect example of why.

dww’s picture

A) Duly noted, thanks.

B) Yup.

C) Okay, fair enough. So we're going to have a bazillion queries for this update. At least it's not 5 or 10 bazillion. ;)

D) I think it's not going to fly to take d.o offline for potentially a few hours, prohibiting 1000s of people from getting things done, just so we don't have to handle this race condition ourselves by writing this update in a way that can handle some duplicates. That's going to be a hard sell to the rest of the infra team, and for good reason. I believe #8 spells out a pretty sane plan for deploying that will a) keep the site live the whole time and b) prevent anyone from seeing weirdness while things are in an inconsistent state. The only trick is to make sure that we actually migrate *all* the data.

One way to do that is to basically deploy the tracker2 change first so that we start populating the flags with new content, *then* go back and migrate everything -- we just can't assume no duplicates. It's not even obvious to me (I haven't investigated closely) why we need to worry about duplicates ourselves -- can't we just let the DB prevent those? {flag_content} already has a unique key defined on array('fid', 'content_id', 'uid', 'sid') which seems to be exactly what we want. I guess that'd be a reason not to try to do all the node flags in a single query -- if we hit a duplicate key error, better for a single flag insert to fail than the entire query.

That said, the stakes are not *that* high if we miss a few. Honestly, I'd much rather announce "On Day X while this is being deployed, if you create new issues or comment on existing ones, you might not automatically be following them and you should check the fancy new button to make sure you're following everything you care about." That'd be much better than saying "On Day X, no one can use drupal.org at all for N hours, period." Of course, I'd rather we just got this working in a way that can handle new content as it's coming in so we don't have to announce either. ;)

Thanks,
Derek

hunmonk’s picture

D) what about making nodes (and editing issue notifications) read-only during the update? a bastardized version of http://drupal.org/project/readonlymode could accomplish that pretty easily. that would prevent any confusion about missing flags and email notifications, and provide people access to all the current content of the site during the entire upgrade. seems like a good middle ground to me.

if we do that, then the single query from B) is viable, otherwise we'd really want to use individual queries and allow them to fail on duplicate.

also, it just occurred to me that we haven't discussed a flag migration path for existing content in tracker2, only enabling tracker to flag newly tracked content going forward. is the party line "flag old issues from the main tracker2 page if you want them flagged, newly tracked issues will be auto-flagged"?

dww’s picture

Re: readonly -- that's still a pretty big burden to put on the Drupal community just to handle a relatively minor edge case. But, it's certainly better than taking the site offline. Either way, I'm definitely okay disabling changes to issue notification settings -- that's not a big deal at all.

Another thought I had was something more like this for the deployment order:

  1. Enable flag.
  2. Deploy the project_issue code, but disable the auto-flagging, hide all of the new UI (and disable editing per-project issue notification settings), and tell the e-mail notification code to keep using node/comment authorship to decide who gets notified. In case it's not obvious, I mean deploy the new code with all that stuff pre-disabled. ;)
  3. Run the main update (could be hours). We can use the single query for flagging all nodes since there's no chance of duplicates. At the end of the update, have it store somewhere (e.g a variable_set()) the last nid and cid it processed.
  4. As atomically as possible, record the max cid and nid on d.o and then turn on auto-flagging in project_issue.
  5. Run a second update that just looks at the nids and cids between the ones recorded at the end of the main update and the ones we just recorded to handle everything that was created while the main update was running.
  6. Turn on the new UI, allow users to manage per-project issue notification settings again, and tell project_issue to start sending notifications based on flag, not node/comment authorship.

In theory, the only race condition here is during step 4 between the instant we record the current max nid/cid in and enabling the auto-flagging, which should be a very small fraction of a second, and I'm definitely not worried about that. The main update in step 3 can still be written "naively" to assume {flag_content} is empty (at least insofar as our follow flag is concerned) and doesn't have to concern itself with handling duplicates. In fact the 2nd update doesn't need to worry about duplicates either since we're going to have the range of nids/cids to handle (although it probably wouldn't hurt to be a little more resilient to duplicates in that one).

Other than slightly more work for us to get all this happening properly, I don't see any problems with this approach. Do you? d.o stays up (and read/write) the whole time. There's no visible weirdness or inconsistent states. It should migrate all the data, without making things overly complicated when writing these upgrade paths. Thoughts?

---

Re: existing content in tracker2 -- I don't understand your question. All the existing content in tracker2 is also in {node} and {project_issue_comments}, which is what we're worrying about here. Following forum posts is out of scope for the time being (although should be relatively easy to do in a follow-up issue). What are you worried about in tracker2 that's not already covered?

Thanks,
-Derek

webchick’s picture

Bear in mind that when drupal.org is down, the entire community grinds to a halt. Progress can't be made on issues, code releases can't be made, support questions can't be answered, documentation can't be edited/written, etc.

In the case of the Git migration, where a race condition meant we might irreversibly lose critical data, it made sense for a multi-hour downtime, because the trade-off vs. risk mitigation balance was worth it.

Here, I'm a lot less convinced. Documentation, support forums, releases, commits, and several other key components of d.o could easily happen without this affecting subscriptions. And what data we do lose here is going to be very easy to get back with the click of a link.

Given that, I'm inclined to put up an announcement alerting people that the migration's going on (maybe using that banner code we have on dev sites?) and just doing the migration while things are still happening. We should probably test these kinds of race conditions on a staging server first, though.

nnewton’s picture

dww and I just discussed this in IRC. A summary:

* The plan in #17 looks fine and I agree that this doesn't need a downtime and shouldn't require one. If we miss a few nodes in the race condition window, I'm sure we can track someone down to play a tiny violin.

* The query in #13B actually will not work, because of how replication locking works. When you do an INSERT INTO SELECT like that, it locks the selected rows from the read query as if they were in a write query (since they are). The upshot of this is that it will write lock the entire node table. Obviously we don't want this. The original patch with blocks of nodes selected and inserted looks fine. I'd suggest larger blocks than 100.

-N

moshe weitzman’s picture

You guys have done the work, so all is well. But I'd be perfectly fine with telling people about the new system and asking them to flag any old issues they care about. That is, no data migration at all.

dww’s picture

Status: Needs review » Needs work

Excellent, very excited to be converging on a viable plan here. Setting this back to needs work for the following:

- Increase the batch size from 100 to 1000
- Add something to the end of that update to variable_set() the last nid and cid processed in the batches
- Write something to handle the 2nd update pass that's restricted to the range of nids/cids as per #17
- Put in necessary code to make it easy to do the various UI/feature toggling I proposed in #17 (e.g. some checks for killswitch variables or something)
- Write something to do the "atomic" queries and variable set from #17 step 4

I think that's it. I'll coordinate with hunmonk offline to sort out who's doing what. ;)

Wondering out loud if there's a sane way to completely automate all of these steps... I suppose if all the various killswitches are in place, the main update could be project_issue_update_6012 (as it is in the patch), the "atomic" toggle could be project_issue_update_6013 (not a batch), the 2nd pass on the restricted range could be project_issue_update_6014 (batching again just to be safe), and then the final cleanup (e.g. removing the now unneeded nid/cid variables, enabling the UI, etc) could be project_issue_update_6015. Does that seem reasonable?

FYI: #1292746: Refresh staging.d.o with live d.o data for testing flag integration and issue following

p.s. @moshe: That's not going to fly. Way too confusing and lots of people would be completely miserable. Do you honestly expect webchick to manually follow every issue in her tracker page?

Lars Toomre’s picture

Sounds like the plan in #21 really needs a test run in a test environment before being attempted on the live site. Has this been done to get a clue of the best time needed to migrate all of the flag data?

nnewton’s picture

@Lars, Thanks for the feedback! We definitely plan to setup a test environment for this, to run both the migration and a user test after. I'll make sure to email you to help with testing.

-N

dww’s picture

Totally untested, but this is almost everything we need. There are a few TODO comments remaining. However, it's late and I need to sleep, but I wanted to post what I've got so far in case anyone wants to review. Be sure to pull the latest on the 34496-flag-integration branch before applying this.

Some reasons this is still at "needs work":

A) Instead of littering the code with killswitch hacks, we just want to *not* have 'project_issue_follow_flag' defined until after these updates are done. That simplifies things considerably. You'll notice that project_issue_get_follow_flag() is defaulting to FALSE when it does the variable_get(). So, there's only a single hack in the module code that knows about the updates: the tiny code block in project_issue_flag_issue() (which is well-commented). However, I'm not sure if variable_get('project_issue_follow_flag_for_update', FALSE); should default to FALSE like that, too, or if *that* should default to 'project_issue_follow'. I can argue both sides. I left it as FALSE since I was thinking it'd be safe to have the admin define this in settings.php or via a drush vset before running the update. Then again, if we let that default to our provided flag, it'll all Just Work(tm)

B) The original patch in #12 would have printed a query message about every. single. node. it. migrated. That'd be fun on a site like d.o. ;) Fixed that, although now the main update (6014) doesn't print anything so that's still a TODO.

C) We still need to process all the comments added on the live site during the first pass that aren't attached to nodes created in that pass. I've got the query all set to find them, but I punted on actually flagging them. These are the ones that are most likely to result in duplicate flags, and where we're going to just have to recompute the flag counts.

D) I now realize we should query MAX(cid) FROM {comments} during the initialization for update_6014() and restrict the comment queries to always ignore comments higher than that value during update_6014(). That's the only way to ensure that new comments coming in while update_6014() is running are all handled during update_6016(). Otherwise, there are race conditions that could lead to missing comments (e.g. if a new comments is posted on old nodes that we've already processed, that won't be reflected in the max cid we're recording as we go, but it'll be outside what we query during update_6015().

E) I haven't made update_6016() a batching update yet. I'm not 100% sure we need to. Even if update_6014() takes a few hours to run, there are probably only going to be O(10s) of issues created in that period, and maybe O(100s) of comments. Maybe we should just make it batching anyway, just to be safe, but maybe we can avoid the complication.

dww’s picture

Still untested, but this fixes everything from #24 plus a few things hunmonk asked about when we chatted about this patch.

I'm going to do an initial test of this over on http://subscribe-drupal.redesign.devdrupal.org to see how we're doing. ;)

dww’s picture

Heh, first attempt at running this was spewing a bazillion errors like so:

Unknown column 'last_updated' in 'field list'      [warning]
query: /* 127.0.0.1 home */ INSERT INTO flag_counts (fid,
content_type, content_id, count, last_updated) VALUES (3,
'node', 1165494, 2, 1306107439) database.mysql.inc:150

That's from the original query from patch #12 to update {flag_counts}.

Let's try again... ;)

dww’s picture

dww’s picture

First real attempt was mostly a success. However, we had a few problems:

- Got some duplicate keys during update_6016() since we had live comments after 6015() ran that had already flagged one of the new nodes created during 6014(). So, we now clear any existing flag records in 6016() before we call _project_issue_update_follow_flag_process_node() -- we also now invoke that helper in a way that it doesn't try to enforce a max cid and just queries for all comments on the node.

- I had reversed the order of singular vs. plural in format_plural() so some of the output messages were bogus. There were also a few other bugs in the messages (e.g. %value vs @value).

- In 6016() when processing the comments on older issues that were posted during 6014, I was using $fid not $flag->fid so the record went into {flag_content} with fid 0.

I also temporarily added a few debugging messages in here just to make sure we get enough data on this next attempt.

dww’s picture

Sweet, that almost perfectly worked! ;) Actually, the migration itself worked flawlessly. No duplicate key errors. Everything was flagged as expected. webchick and I both pounded the site while the upgrade ran, so we hit all the expected edge cases:

NOT flagging node 1243552 for user 46549 because of comment 4841371    [success]
(already flagged)
NOT flagging node 1236280 for user 24967 because of comment 4841372    [success]
(already flagged)
Flagging node 1243552 for user 60783 because of comment 4841373        [success]
Flagging node 1238484 for user 24967 because of comment 4841375        [success]
Flagging node 1171828 for user 46549 because of comment 4841377        [success]
Flagging node 1164492 for user 24967 because of comment 4841378        [success]
NOT flagging node 1166476 for user 46549 because of comment 4841379    [success]
(already flagged)
NOT flagging node 1238484 for user 24967 because of comment 4841380    [success]
(already flagged)
NOT flagging node 1164700 for user 24967 because of comment 4841384    [success]
(already flagged)
NOT flagging node 1172214 for user 46549 because of comment 4841387    [success]
(already flagged)
Added follow flags for 4 comments which were posted on existing        [success]
issues while update 6014 was running.

There's only one tiny problem in how one of the messages was printed, but it was a silly bug on my part in the printing, not a bug in the migration.

SO... I'm tentatively calling this a complete success. However, I'd really like to try it on the full d.o dataset on beta.d.o. If only #1292746: Refresh staging.d.o with live d.o data for testing flag integration and issue following were done...

Anyone want to look through the code to see if they spot anything funny?

But, I believe this is RTBC now. HURRAY!

hunmonk’s picture

code looks good except for one oversight i think:

{flag_counts} (at least in the 6.x-2.x version i'm running) has a 'last_updated' field that's not addressed in the current patch. that field records the current unix timestamp when the count total is updated. since the flag API adds the flag then immediately updates the count, and since we're building historical data, we should use the timestamp from the created flag to update this value. we're already collecting that and using it for {flag_content} -- this change adds support so it also is used in {flag_counts}.

untested.

dww’s picture

Re: #30: Thanks for the review. However, that last_updated field was the source of errors on my testing. See #26. This column does not exist as of flag 6.x-2.0-beta6 which is what I'm testing against and targeting for deployment. Therefore, I'm planning to go with patch #29.

dww’s picture

Status: Needs review » Fixed

Although we're still awaiting final testing on a copy of the live d.o database, I've committed and pushed patch #29 to the 34496-flag-integration branch. Calling this fixed for now. We can always re-open if we uncover problems in further testing (which is going to happen once #1292746: Refresh staging.d.o with live d.o data for testing flag integration and issue following is done).

p.s. quicksketch confirmed in IRC that flag 6.x-2.0-beta7 is *not* just around the corner, which is why I'm sticking with patch 29...

webchick’s picture

bomp bomp bomp. another one bites the dust-AH!

YAY!

naught101’s picture

Can I just say: Holy shit, you guys are awesome!

Automatically closed -- issue fixed for 2 weeks with no activity.