apachesolr_site_hash() calls md5() twice which is not only inefficient, it reduces the possible combinations available. The attached patch only calls md5() once.

As apachesolr_site_hash() checks if the value is set in the variable table, this change won't have any impact on users who have "site_hash"es already defined.

Comments

skwashd’s picture

StatusFileSize
new559 bytes

Trying to attach the patch again. It was attached before I submitted :\

robertdouglass’s picture

One of the unfortunate problems of this patch is it will cause the need to delete the index and re-index everything.

Scott Reynolds’s picture

This strikes me as more then a typo. It has the feel of being done on purpose. Any idea?

pwolanin’s picture

Status: Needs review » Closed (works as designed)

Of course it's done on purpose - to reduce any possible information leakage about the private key.

Also - there is no reason this would reduce the combinations in any meaningful way, and it's only done once so performance is a non-issue

skwashd’s picture

Status: Closed (works as designed) » Active

@robertDouglass it won't need to be reindexed as the key is only generated once and then stored in the variable table.

@pwolanin the key generation code isn't a black box it is available as PHP source code in a public repository and distributed unencoded. There is no (psuedo) random value used so the key is reproducible. There is, at most, only 16^32 possible combinations outputted from the first call of md5(), so you are hashing a subset, then taking a sub set of that string as the final value, leaving you with less than 12^32 possible keys. The second md5() provides no benefit in generating unique or more secret keys. If the aim is "to reduce any possible information leakage about the private key", then introducing some randomness to the seed, such as mt_rand() or time() would better achieve this goal. Reopening as the current code doesn't fulfil its stated purpose.

Scott Reynolds’s picture

@robertDouglass it won't need to be reindexed as the key is only generated once and then stored in the variable table.

and then stored in the solr schema for every node...

damien tournoud’s picture

Status: Active » Reviewed & tested by the community

Let's do that. As stated, the key is only generated once, changing the way the key is generated will not impact any existing installation.

damien tournoud’s picture

Status: Reviewed & tested by the community » Needs work

This said, I agree that this should just generate a random value. There is zero point in using the private key here.

The standard way to do that is:

md5(uniqid('', TRUE))
skwashd’s picture

@Scott Reynolds yes, but the key is only generated once then stored, so changing the key generation algorithm has no impact on existing users as their current key will remain unchanged and so not necessitate reindexing.

@Damien Tournoud I'm happy to do it either way, so long as it doesn't involve calling md5(md5()).

I'll wait for followup comments before rerolling.

pwolanin’s picture

@DamZ - no point in the md5 even?

damien tournoud’s picture

@pwolanin: no real point in md5() here, except to normalize the value (in the sense: stretch it to a normalized size).

skwashd’s picture

uiniqid() only appends the the microseconds on *nix, under windows you only get the hexadecimal value, so you have 1 second windows for race conditions under windows. The risk of collisions under *nix are reduced, but to remove the risk (almost) completely on either platform, you could use the following:

$hash = substr(md5(uniqid($base_url), TRUE), 0, 12);

You end up with a unique key that has a random and a known component and the format remains unchanged from current implementation.

damien tournoud’s picture

@skwashd: no, the hex part is the part that contains the time. The rest is controlled by the $more_entropy parameter, and the risk of collision is unbelievably small already.

skwashd’s picture

@Damien Tournoud

@skwashd: no, the hex part is the part that contains the time. The rest is controlled by the $more_entropy parameter, and the risk of collision is unbelievably small already.

OK, I was wrong. Did some RTFMing and leant something today.

pwolanin’s picture

Uniqueness is really not that big a worry. This value is only useful if you are running a multisite search or doing some reindexing tricks, and typically even those people will have just a few sites.

Given that there are other problems with long URLs, I suggest we try to reduce the # of characters further.

Something like:

substr(base_convert(sha1(uniqid($base_url, TRUE)), 16, 36), 0, 6);

Which gives us a default hash space of about 2 * 10^7 and is the same as the 12 character hex value.

Or we could go for 5 characters - still about 60 * 10^6 values in the hash space. Seems more than enough.

pwolanin’s picture

Version: 6.x-1.0 » 6.x-1.x-dev
Status: Needs work » Needs review
StatusFileSize
new853 bytes
skwashd’s picture

Version: 6.x-1.x-dev » 6.x-1.0
Status: Needs review » Needs work

Uniqueness is really not that big a worry. This value is only useful if you are running a multisite search or doing some reindexing tricks, and typically even those people will have just a few sites.

I have ~2100 production drupal sites using an pair of apache solr servers. Although I don't plan provision additional sites any time soon, I'd like to avoid the potential for race conditions when having aegir deploying 20 sites at a time.

skwashd’s picture

Version: 6.x-1.0 » 5.x-2.x-dev
Status: Needs work » Needs review

reverting field changes

pwolanin’s picture

Version: 5.x-2.x-dev » 6.x-1.x-dev
StatusFileSize
new853 bytes

6.x-1.x is the main branch for fixes. We can leave this at 6 digits, which should be the same size space as currently.

uniqid() with extra entropy shoudl be pretty immune to race conditions - the only issue is collisions in the hash space. If you are worried about actual race conditions, it would be better even to leave out the $base_url.

pwolanin’s picture

Version: 6.x-1.x-dev » 6.x-2.x-dev
Status: Needs review » Patch (to be ported)
StatusFileSize
new853 bytes

Fixed code comment - committed this patch to 6.x-1.x

robertdouglass’s picture

Version: 6.x-2.x-dev » 5.x-2.x-dev

Committed to 6.x-2.x.

jpmckinney’s picture

Status: Patch (to be ported) » Fixed

Status: Fixed » Closed (fixed)
Issue tags: -Performance, -code cleanup

Automatically closed -- issue fixed for 2 weeks with no activity.