As we've talked about quite a bit in the past couple months, git is internally capable of giving contribution credit where its due. Necessarily, however, that is going to change the way we think about commit statistics. And that's after the count changes resulting from commit atomicity...but let me explain that first to ensure everyone reading this is on the same page with that.

Right now, a 'commit' on the d.o CVS system corresponds to a change in a single file, not an actual commit action. So if you make changes to 20 files and then commit all those changes at once, it will actually show up as 20 commits on the project/your user listing. This is because CVS commits are non-atomic. Since git commits are atomic (along with basically every other vcs), there's going to be a change that drastically reduces the number of commits everyone on d.o has, simply because we're counting atomic commits instead of number of files changed. So that's the first change.

With the capacity to track contributor commits, though, there's more thinking to do. If we want to give any recognition to contributor participation (which I do), then I think the best approach is to start delineating between 'contributor commits' and 'maintainer commits'. Respectively, those would be commits performed by contributors to an issue repo and later merged into a project repo (but only counted once that merge occurs), versus commits done by the maintainer her/himself. We might even want to split maintainer commits further, to delineate between issue merges and direct author work. Starts getting complicated, I know, but the data's there, and it's at least worth having a conversation about.

The decision we make on this front also has to dovetail with the decision we make on the recommended process for maintainers merging in work done in issue repos. On that front, we have two competing goals:

  1. Allow contributors to really duke it out in an issue, making commits all over the place without worrying about anything except getting to the right code
  2. Allow mainline/project branches to preserve clean histories

Ordinarily this is easy enough to resolve: either it's the responsibility of the contributor to create a clean, outward-facing branch to pull from (basically the kernel model) or it's the responsibility of the maintainer to pull in and do that maintenance. I think the latter is probably more appropriate for our case; as long as the contributor(s) ensure the merge is clean & good, all the maintainer has to do is a squash merge/rebase. That presents us with a problem, though: squashing is not a history-friendly act, as it doesn't actually pull the commit objects over, only the final state of the content. This is unfortunate, because it means we don't have the actual commits made by the contributors, so we can't count them (also because those commit objects won't actually exist in the project repo, future merges that incorporate only some of those changes might get...a little complicated). This is somewhat overcome-able because squash merges default to populating their commit message with all the commits that are being squashed, so we could parse that for commits to be credited. But in the interest of keeping the project branches clean, we'd either probably need to require squash merges, or require the #nid (+branch name?) in the commit message to trigger proper crediting.

This technically doesn't need to be fully resolved until phase 3, but I'm marking it phase 2 because I think the decision may bear on our thinking even during the initial conversion. 'cuz it's definitely bearing on mine.

Comments

Crell’s picture

First question: Define "clean"?

As a module maintainer, what impact does it have on me if when I merge from issue 12345 into the master branch of my module in the "real" repo I get one commit object, 12 commit objects with a clean progression from one to the next, or 80 commit objects that bounce around a lot before arriving at a good conclusion? (Really, I have no idea what difference it makes; that's why I'm asking. :-) )

I suspect the process of turning 80 crazy commits into 12 pretty commits is sufficiently non-trivial that 98% of module maintainers and 99% of other contributors will neither know how nor care enough to actually do so, so we're going to end up with a lot more of those 80-crazy-commit merges.

theunraveler’s picture

I agree with @Crell. As a module maintainer who has a somewhat tenuous grasp on git, if the "process of turning 80 crazy commits into 12 pretty commits" is not "trivial", I (and others) will not do it very readily. If all it requires is a "squash merge", then you have my blessing :-)

Also, I don't really see a problem with the first scenario. As long as those 80 dirty commits eventually make their way into one nice release, I think the effect will be the same. Plus it will be easier to tell who is contributing where, and not just whose commits get merged.

Crell’s picture

theunraveler: I think the issue is that if it's as "simple" as "squash merge" to convert 80 commits into one commit object and then merge that into the master branch, you lose track of who was responsible for those 80 commits, which could be a dozen people.

On the other hand, if we keep those individual commits then the resulting commit log will be all kindsa weird, and include mention of people who threw in crap that got quickly removed.

I think we're in agreement, though, that in most cases "go through by hand and tidy up those 80 commits into a pretty history" is just not going to happen, regardless of who is nominally responsible for it.

'course, I don't know how often the issue of 80 commit object "patches" is going to come up in practice. Certainly far far less in contrib than for core.

bojanz’s picture

I'm not sure there's a need for squash merge, at least for contrib.

Having history of an issue is a good thing.
If I (as a maintainer) found an issue, and needed to code a fix, there's a big chance that I would do it in more than one commit, so if it's not a big problem then, why should it be for project issues?

Plus, I believe most issues in contrib will have less than 10 commits (if not even less).
If some of those commits are trivial/irrelevant, a quick interactive rebase can fix that (not sure if rewriting history would be a problem in this case...).

Crell’s picture

bojanz: So are you suggesting the 80-wandering-commits approach, on the assumption that it won't actually be a problem for the majority case? You're probably right, I suppose. :-) Core is where we're more likely to have competing approaches and other weirdness in the queue.

bojanz’s picture

Yes, that's what I'm saying.
The key point is that there's no need to kill the history of an issue that was merged. It's just as relevant as the main module history.

If there were irrelevant commits ("forgot to change this", "adding a comment to the new code", etc), they can be cleaned up before the merge, further reducing the number of total commits.
Otherwise, keeping the "crazy" commits might even be beneficial (referencing the competing approaches you mentioned, etc...)

Can't say anything about Core, you and the others who made the "mega-patches" such as dbtng and field api probably know what would work best in such cases.

Crell’s picture

sdboyer: If a particular issue has its own branch, is there a way for the owner of that branch (whatever that means) to selectively squash/prune/revisionist-history that branch *easily* before setting it RTBC (or RTBM, as the case may be)? Easily being the operative term there.

It sounds like we may be leaning toward "just merge it as is, warts and all, and if you're a rare case that needs to clean up first then the maintainer and issue owner work it out before merging".

avpaderno’s picture

If a particular issue has its own branch, is there a way for the owner of that branch (whatever that means)

It has been proposed that each branch of a project has its own ACL. In example, for Drupal this would mean that Gábor Hojtsy can commit code to Drupal 6 only, drumm can only commit code for Drupal 5.

Crell’s picture

Right, I mean if we have a branch/repo/whatever associated with this issue, I don't know who "owns" it. sdboyer, since he created it? Whoever has it marked as "Assigned" to them? Owner of the project that it's associated to?

I have no idea. I'm just babbling at this point. :-)

cweagans’s picture

Having the history of an issue could be useful. Let's use Bartik as an example: there were a whole bunch of commits that happened when this theme was being developed. Because all of the intermediate commit history was lost, there's no way to see what issues came up during development of the theme and, by extension, no way to see how those issues were fixed (or if they were).

For a new user coming into Drupal, it could be a cool research tool: "hmm...let's see....I'm developing a module similar to Blog API. I wonder if the core developers had this same issue while they were building Blog API. Oh, look! They did! And here's how to fix it!"

sdboyer’s picture

Sorry to not get back on this right away, especially after marking it as an issue of the day. Good news is, y'all are bringing up most of the sorts of issues that have been confounding my thinking about this, so we're gonna get it all on the table :)

First, on defining "clean" - here's Linus definition, fwiw: http://www.mail-archive.com/dri-devel@lists.sourceforge.net/msg39091.html . It's generally a good definition to start thinking from, and touches on appropriate/inappropriate uses of rebase, too. The squash vs. interactive rebase thing has been pretty well-explained, so I'll leave that.

@kiamlaluno #8 - that's something we're going for, yes, but it's irrelevant here. The ACLs, per-branch or otherwise, have nothing to do with whose commits can end up where; all it governs is who can push commits into a particular branch. The concern here is defining a) the actual algorithms we'll use to calculate commit statistics, and in tandem with that b) the best practices for maintainers to use when interacting with the eventual per-issue repos.

@Crell #7 - not within a given issue branch, no. Because issue repos are going to be fast-forward only, and rebasing means rewriting history (aka, a non-fast-forward push), you won't be able to rewrite a branch that already exists in the issue repo. However, if you're RTBC/M and want to pretty it up, you can basically pop off a 'clone' of the development branch, interactively rebase the clone, then push that prettied-up branch into the issue repo.

There's legitimate cases to be staked out in support of recommending or not recommending squash, partial rebase, or straight merge workflows. And for the record, I think cweagans' point in #10 is an especially important example of why we want to preserve that history. In my thinking, though, that doesn't mandate full merges into the main project at all. These issue repos will stay open for all time, and so as long as the work is preserved there, it can still be gotten to. And if we create some drush extensions that, say, traverse a segment of a project's history and add remotes for all issue repos that were merged from, then ipso-facto, you can get all that history at your fingertips with no problem.

While we do need to settle on the recommended merge workflow, when it comes to statistics, the most important thing is that module maintainers are free to follow any of these lines of reasoning either as personal standard practice, or depending on the use case, or because they only grok one approach - and that the statistics will do their best to adapt. To that end, here's what I propose (and this supercedes anything in my original post). Note that I still feel like this is overcomplicated and I wish there was just something dead-simple, but every time I think through it I can't come up with something simpler that is flexible, meets all scenarios well, and still tells us something useful:

There'll be two layers to statistics tracking: as a baseline, every commit to a project repo is counted as a commit. Eventually, we'll also count commits from branches in issue repos that have been merged into mainline. The simplest case (wrt stats), a standard merge, will be handled without any additional work: since all the commits are pulled right into the project repo, they're counted directly. Squash merges, then, will be counted by hopping over to the project repo and counting all the commits that were included in the squash.

The only thing that'd be a little difficult is handling prettied-up branches. If you pretty up a branch, you lose commits. Period. After that, though, it's presumably going to be a standard merge into the project repo (there's no point in prettying a branch if it isn't), so the commits that aren't squashed into others will be counted directly. The commits that ARE squashed, though, are going to be hard to detect. Frankly, it'd be easiest if they just don't get counted, and then we make it standard practice to only squash truly superfluous commits. The alternative is coming up with some special way of marking the merge as being based off of a prettied branch, and then doing some other complicated stuff.

One thing I can say is - we are going to have the most mind-numbingly amazing code swarms EVER.

Crell’s picture

So just to clarify, I have only a vague, marginal, and probably incorrect understanding of what "squash merge" and "rebase" actually mean/do. Which means I probably have more clue what I'm talking about than 95% of contrib maintainers. And I understood maybe half of #11. :-) That's a situation we need to keep in mind.

So given that, trying to restate #11 in layman's terms:

- If a maintainer adds code to a project branch via a patch, it's his history, and it shows up a single commit from him. Same as it does now.

- If a maintainer adds code to a project branch via an issue branch, then he has the following options:

1) git merge issue1234/branchname project/7.x-1.x. All commits in that issue get pulled in as is, warts and all, and everyone with a commit gets credited for it.

2) git branch issue1234-local --track issue1234/branchname *Screw around with squashing and rebasing and other things that most people including me don't comprehend, which may kill some commit records along the way.* git merge issue1234-local project/7.x-1.x

And I have no idea what the "call out to the original thread and get stuff from there" stuff is all about, frankly. It seems over-engineered. Frankly if the maintainer thinks its wise to expunge a given commit from the official history, then there's probably a good reason for that so we shouldn't be recording it anyway.

There's perhaps a dozen people who would have legitimate use for #2 in practice; Dries, Earl, maybe quicksketch, etc.

I guess my point is that whatever we do has to be with the understanding that while git may have 500 commands and possible workflows if we expect anyone besides core maintainers and people like Earl or Jeff Miccollis that maintain huge, many-developer modules to use more than about 5 we're going to run into a wall of "G'Wha?" that will kill the whole effort.

sdboyer’s picture

Yep, that summary is on target. I need to draft a volunteer into creating an image that visually demonstrates how each of these things would work.

Nobody would need to be aware of the more complex workflows in order to do their thing. That's a big check.

"Call out to the original thread" goes something like this. On the d.o git server:

1. Sees there's a commit with a squash merge in it. Identifies that start and endpoint of the squash merge.
2. Looks through the commit message for an issue nid.
3. Grabs the repository corresponding to that issue nid, and read in all the commits to the stats log that fall within the start and endpoints.

Not very complicated at all.

Crell’s picture

OK, why would we want to do that all out? If the maintainer felt it appropriate to remove certain commits or squash them or whatever other black magic he does before merging, shouldn't we respect that? There was no doubt a reason for him to do so.

bojanz’s picture

@Crell
If I understand it correctly, it's done so that the issue committers get credited for their commits to the project even after the issue branch has been squashed and merged (which makes those commits non-existing in the main tree).

Crell’s picture

Right. I'm saying why aren't we trusting maintainers to make that decision. Just because someone threw crap into an issue branch somewhere doesn't mean they contributed anything useful, and if the code they provided was stripped out later on anyway why should they be credited for barfing into an issue?

My point is that if a maintainer takes option #2, we should assume that they know what they're doing and have the commit history that they feel is appropriate. That not only simplifies the code, it more closely models what we do now (where a maintainer is at his discretion for who to mention in a commit message based on his judgment). I expect the number of maintainers taking option 2 to be small anyway.

bojanz’s picture

That's a good point. We need a way to prevent people from trying to commit crap in order to get credit..

Crell’s picture

Status: Active » Needs review

So I guess we now call #11/#12 "needs review" as that looks like the approach we'll take...

sdboyer’s picture

Status: Needs review » Active

Squash merging has nothing to do with not thinking a set of commits are worthwhile; if you think certain commits shouldn't be included, that's what incremental rebase would be for. DBTNG is a great example: maybe a thousand commits along the way, but if I were in the driver's seat, the final thing would be squashed. If they're directly merged, someone traversing the history of the mainline will see minor, incremental commits right next to huge commits on an unrelated work series. This approach to counting squashed commits has the benefit of 1) actually reflecting the rough size and scale of individual areas of contributions while 2) still making it possible to look at the development mainline and follow the meta-progress of the project.

Crell’s picture

OK, I am confused now. Probably due to not actually grokking squashing, rebasing, and merging multiple in-development branches so I don't really understand what you're saying in #19. :-) Can you elaborate on what those all actually mean/do? (And/or point to links; either way we should archive whatever is said here for later people asking the same question.)

sdboyer’s picture

Raaagh. As I've been discussing and thinking about this with people, I've been feeling the need to sit back and brainstorm some more. So...I'll be back soon with more info/ideas.

sdboyer’s picture

Project: Drupal.org infrastructure » The Great Git Migration
Component: Git » Migration scripts
Priority: Normal » Critical

"soon" is such a delightfully relative term.

marvil07’s picture

After reading this and talking a little about it on a bof about analysis contribution today, I would say we are over-complicating things.

About workflow

Git gives you the power to do anything you want, that's it, you decide.

IMHO we can only set general things like non-fast-forward on the repository, but getting so inside the workflow sound like a really bad idea to me.

I think no so much people are really going to use that, and like discussed on other related issues, we must not increase the barrier to enter, I mean, anyone have the intention to track patches per issue and branches per issue at the same time?

I think rolling patches would work most of the time(it does not matter if you use N branches locally to track changes of different people), in the other side tracking big changes can be done by enabling sandbox per user per project. That sounds reasonable an simple to me.

About attribution

In the simple case, when there is only one contributor, you can get the attribution directly from the patch(git-format-patch and git-am).

In the more-than-one-contributor case, I think trusting maintainer(committer) is a good idea to give attribution through the commit log message and the we can just parse them (actually I have made some scripts to do that(see drupalcodeswarmlog.py)).

webchick’s picture

Title: Settle on an approach to recording commit statistics (and thereby, standardize the recommended merge workflow) » Decide on an approach to recording commit statistics (and thereby, standardize the recommended merge workflow)
Issue tags: +git sprint 1
chrisstrahl’s picture

Assigned: Unassigned » sdboyer
webchick’s picture

#898816: Consider using real names and/or e-mail addresses for the author/committer metadata? looks like it's related, talking about a reasonable way to set up names in .Git config so that we can track username data.

mparker17’s picture

Subscribe.

sdboyer’s picture

Status: Active » Fixed

OK, I've put a lot of thought and reflection into this, and I agree with marvil07 - I've been overcomplicating it with all the discussion of squash merges and complex workflows. We should be generating commit statistics based purely and exclusively on the commits that live in the project repository itself. So it'll be up to the maintainer to decide what credit will ultimately be given for, based on how they merge/what they push. No need to call out to separate repositories, no need for hard thinking about what the 'right' way to give credit is - just merge and go.

In addition to KISS, a big reason for changing my thinking is that my original concern about 'polluting' mainline branch history with topic branch commits was really overblown. All it takes is some extra parameters to hide them all from being output - for example, look at this complex merge history from the Panels/CTools sprint we just did:

Only local images are allowed.
output from git log --pretty=oneline --abbrev-commit --decorate --graph

That's a ton of different commits on different branches, all of which have been merged back into a mainline. That complexity can be hidden quite easily, though, just by adding --first-parent, which shows only commits derived from the first parent of a merge commit:

Only local images are allowed.
output from git log --pretty=oneline --abbrev-commit --decorate --graph --first-parent

You can even go a step further and wipe out the merge commits, seeing only direct commits on the mainline:

Only local images are allowed.
output from git log --pretty=oneline --abbrev-commit --decorate --graph --first-parent --no-merges

In short, we can have our cake and eat it too - history can be made clean for human consumption, but we can still retain all the relevant committer information in a single repository. Done and done. Commit statistics, then, are straightforward - if it's in the project repo, you get credit. Otherwise, nada. We might consider noting commits on mainline branches in some slightly way, but that can all be tweaked later.

So, what does this mean for recommended merge workflow? Pretty much standard rebase/merge (which I won't detail here, but is doc'd elsewhere), with the only slight exception being that maintainers should get in the habit of using the --no-ff option to git merge when merging in changes made by other people. That ensures no fast-forward merges - see http://nvie.com/posts/a-successful-git-branching-model/ for more on why that's important.

I'm gonna mark this fixed. Please reopen if someone sees a big problem with this.

Status: Fixed » Closed (fixed)
Issue tags: -git phase 2, -git sprint 1

Automatically closed -- issue fixed for 2 weeks with no activity.