Background

Problem

  • Serial IDs are only useful within the scope of a single site, but lose their entire meaning in staging scenarios.
  • Menu links as well as reference fields (e.g., taxonomy term, file, image, ...) store the serial ID as value.
  • Links within content contain serial IDs and thus may point to a wrong target after being staged.
  • Serial IDs present collision problems with migration: #2748609: [meta] Preserving auto-increment IDs on migration is fragile

Goal

  • Make entity references compatible with content staging.

Proposed solution

  1. Replace serial IDs with UUIDs in reference field values.
  2. Replace serial IDs with UUIDs in URLs.

Details

  • Storage lookups for integer/serial IDs vs. string UUIDs should be identical performance-wise.
  • URL paths, such as /node/1637370, are not really any nicer than /node/4bd1c937-f530-456f-a218-241d7bfd40b8.
  • For backwards-compatibility, there could be URL aliases for the original serial IDs.

Related issues

Notes

  • Upgrade path will become quite complex, especially for in-text/content links.
  • http://drupal.org/project/uuid_redirect is a contrib attempt to use UUIDs in URLs.
  • Consider to remove the $entity_type argument from entity_load_by_uuid(), and introduce a global mapping table for UUID » entity_type mappings (required since entities live in different storages), so entities can be loaded through the UUID only.
Files: 

Comments

plach’s picture

I have some concerns about the current proposal:

  1. Given the available timeframe, moving all the system from serial ids to UUIDs looks like something that could easily result in a half-baked solution.
  2. I'd be very surprised if it turned out that this move had not any overall performance implication.
  3. I don't agree that from a [D|U]X perspective /node/1637370 is "not really any nicer than /node/4bd1c937-f530-456f-a218-241d7bfd40b8": I can memorize the former for at least a small amout of time if I need so, the latter is impossibile to.

I think we may want to introduce some kind of entity permalinks in the form of entity/4bd1c937-f530-456f-a218-241d7bfd40b8, but I don't think we should replace the URLs using the serialized id.

Instead I think we should introduce a mapping during the migration between the source and destination environments: if I understood the WSCCI Web Services Format Sprint Report correctly, we should be able to identify all the references, which will obviously carry the UUIDs of the referenced entities. A table, like the one mentioned in the OP, mapping UUIDs to entities without the need to specify the entity type, would let us translate the serial ids on the fly. Texts could be processed to find known (pluggable) patterns to replace the ids the same way. This looks like a failry more encapsulated and less invasive solution to my eyes.

Note that I willingly used a terminology picked from Migrate, because it would be very cool if we could rely on it besides the upgrade path.

andypost’s picture

I see no way of using One tremendous huge table of uuid-mappings. OHOT we could implement a kind of "codifying" a plugin name or plugin uuid into entity uuids so getting uuid core could choose a plugin to create an instance and serve the request

Also this system would be half-baked without auto-increment field (serial field) because a lot of systems already use serial node IDs for uniqueness.

webchick’s picture

TL;DR "What plach said."

Making public-facing URLs convoluted and screwed up for 99% of sites that edit content directly on production (e.g. Drupal.org, everyone's personal blogs, etc.) for the 1% that has a separate staging environment or a centralized content store that gets pushed to sub-sites gets an enormous, flaming, marquee tagged -1 from me. This means that every single Drupal user now has to deal with unmemorizable URLs, which are effectively the "user interface" to the website, and end up being typed into the browser and being shared in emails/tweets/etc.. And "Just Google it" doesn't work with intranet sites, nor for brand new content Google hasn't figured out yet.

Also, I disagree with the "Details" in the issue summary:

"Storage lookups for integer/serial IDs vs. string UUIDs should be identical performance-wise.
...
For backwards-compatibility, there could be URL aliases for the original serial IDs."

These two are completely at odds with each other.

"URL paths, such as /node/1637370, are not really any nicer than /node/4bd1c937-f530-456f-a218-241d7bfd40b8."

There is a huge difference, in that the former is memorizable (especially since most sites don't have 1m+ node ids), fits into a tweet, and neatly communicates sequence and as to whether it's newer or older than node/16371. 4bd1c937-f530-456f-a218-241d7bfd40b8 is none of those things.

I think we certainly need to support people who need complex workflows (and plach's suggestions sound interesting), but we are not screwing up our default product for 99% of Drupal websites out there. And let's not even mention the upgrade path. :\

Crell’s picture

-1, for pretty much all the reasons plach said.

Joining on a UUID string is slower than on an int.

ints are easier to debug.

/node/12345 is way easier to deal with than node/4bd1c937-f530-456f-a218-241d7bfd40b8.

I run a site with a half million nodes. No, I am not using pathauto. The idea that everyone uses pathauto is nonsensical. Drupal.org doesn't use pathauto.

Deployment is still an edge case. The vast majority of sites don't have complex deploy or content re-syndication going on. We shouldn't break everything in Drupal for that edge case. We should just enable the edge case to not be as gross as it is today.

That means we should be tracking UUIDs along with serial IDs, sure, but not making them the primary ID.

kattekrab’s picture

Please. No.

I find /node/nid totally grokkable and often use the nid as a shortcut. And I tell my clients and users they can do that too. That would be impossible with the uuid.

So with all respect I humbly disagree. There is a massive difference.

sun’s picture

Please note that I basically just "forwarded" this idea/suggestion into an own/proper issue. :)

For content entities, there are indeed alternative ways to approach this; e.g., by adding both the serial ID and UUID to reference tables. That essentially leads to a mapping though.

What can't be really and reliably solved though is the case of references (e.g., links) within (text) content. E.g., if you prepare your shiny fancy new product section on the development/staging site and your descriptions happen to contain links to all the related pages, then those text links might point to entirely different content entities after staging the product section to the production site. — It is possible to work around that in a few ways; e.g., by using URL aliases and only linking to those, or by using inline macro tags instead of plain links in the text, but in the end, those are poor (and partially complex) workarounds.

I've seen quite a couple of modern systems recently that are using UUIDs exclusively. Once you've seen such a system in action, you realize how much pain this pretty simple and straightforward move cuts away. It's one of those situations where you sit in front of that and ask yourself why you're wasting any more time with getting serial IDs to work in a reliable way... ;)

I don't think there's a performance aspect. UUIDs as primary keys can be directly compared with the primary keys of cache tables.

But anyway - while I think it is a good idea, there are workarounds for content entities. I'd still like to see some more discussion on this topic though; especially as the people who primarily advocated for it did not follow up yet.

heyrocker’s picture

I am an advocate for a URL system that is not tied to serial IDs, because it doesn't just make content staging hard, it makes it impossible. When people create a piece of content that links to other pieces of content on their site, they are not going to remember a special entity permalink. They are going to copy and paste the URL as it appears in the browser. When this content is deployed, those links will break. This was one of the most common complaints against Deploy, if you're not using Pathauto, or if your Pathauto settings include the nid, everything falls apart with embedded links. There are several answers to this.

1) UUIDs in URLs. While I agree with all the objections to this, I also feel that the tradeoff here needs serious consideration.
2) Put Pathauto in core with a reasonable set of defaults for all the core content types. I actually think this is the best solution, but it will also take a lot more work and there's a question of what to do about hosting providers that don't support mod_rewrite out of the box (if this is still a problem.)
3) Write a code that takes internal URLs to serial IDs, maps them too UUID-based URLs on content export, then maps them back to serial IDs on import in the new system. Ugly.

One way or another though, something has to be done about this problem for multisite content staging to work *at all*.

As far as replacing serial IDs system-wide, I think this bears investigation, but it is wayyyyyy too late in the cycle to consider it. This should have been researched a year ago but now? No way. Defer to D9.

plach’s picture

We might introduce optional UUID-based permalinks. They might work like clean urls did: by enabling UUID-based urls url() would generate entity/4bd1c937-f530-456f-a218-241d7bfd40b8 instead of node/1. This way the relatively few sites needing this feature would have it without impacting on the majority of other sites.

heyrocker’s picture

That would only help for URLs generated with url(), it wouldn't help the use case we're discussing, which is people copying/pasting a URL into a textfield.

I also disagree with the 'relatively few' designation. Pretty much every client I've worked for professionally has wanted this functionality in one form or another. While this has admittedly been mostly larger organizations, there are still quite a lot of those in the world.

webchick’s picture

I think there are mainly 3 types of content staging:

1. None: Just edit content directly on the production website, and toggle the publish flag on things that aren't ready yet (all small websites, personal blogs, Drupal.org, etc.)

2. Access-based solution: Edit the content directly on the production website, but let something else (Workflow Access, OG, etc.) handle when content is public or not. (pretty much every client I've ever worked for)

3. Staging server solution: Edit the content on a completely different website, then deliberately push content or collections of content to the production server (common workflow in enterprise CMSes, also apparently pretty much every client Greg has worked for)

UUIDs in URLs solves the problem for group #3, but definitely at expense of group #1 and #2. I'm not saying group #3 is "1%" but they definitely are vastly outnumbered by 1 and 2.

As one minor data point, the Large Scale Drupal program consists of a handful of organizations one would definitely call large, "enterprise" customers. They were pitched proposals for solving the content staging problem along the lines of either #2 or #3 as possible workflows. They chose to back the solution that layered on top of #2.

plach’s picture

@heyrocker:

Sorry Greg, but I don't get your objection: if you have a site that emits only UUID-based URLs, the ones that get copied in the text fields will be still UUID-based, no?

heyrocker’s picture

I might be confused actually, because when you said URLs output by url() I hadn't realized that meant we would emit UUID-based URLs in the browser address bar. If that is the case, then that solution could work.

As one minor data point, the Large Scale Drupal program consists of a handful of organizations one would definitely call large, "enterprise" customers. They were pitched proposals for solving the content staging problem along the lines of either #2 or #3 as possible workflows. They chose to back the solution that layered on top of #2.

It should be pointed out that a large part of the reason #2 was chosen is because #3 is literally impossible to implement in D7. They really didn't have any choice.

Crell’s picture

I should also note that, while I want #3 to be possible, a good implementation of #2 can render #3 far less necessary. Most of the clients I've had of late that have said they wanted #3 actually, in practice, are perfectly happy with Workbench Moderation (a version of #2), or at least happy enough.

(Ironic, given that there's a not-small overlap between my clients and heyrocker's. :-) )

sun’s picture

Actually, I wonder...

Is it possible that the new entity system in D8 might even be so cool that people could just simply opt-in into "UUIDs as primary IDs"?

We've replaced all direct $entity->id property accesses already, so:

function uuid_entity_info_alter(&$types) {
  foreach ($types as $type) {
    if (!empty($type['entity keys']['uuid'])) {
      $type['entity keys']['id'] = $type['entity keys']['uuid'];
    }
  }
}
heyrocker’s picture

I have had at least one very large multinational retail client tell us that they actually can't have an time-sensitive data *exist* on the production server in any way shape or form until it is meant to be public without running afoul of their legal department. That said, I think that what Crell is saying is true (although of course its only semi-recently that even #2 has become possible in any real sense of the word.) Ultimately, the one thing they all want is the ability to preview page layouts with true content without making those layouts live. Doing it on a staging server and deploying is one answer. Doing an access/layout based solution is another. We do want to support all the use cases.

Crell’s picture

sun: No, because that would impact DB schemas, many of which currently expect an int. (That's debatably a bug already, but it's there.)

skwashd’s picture

This is not an easy problem to solve.

Some time ago I worked on multiple ports of uuid.module from D6 to D7. During that time I was involved in many discussions on IRC about the merits of different approaches - I think I ended up doing 3 (or was it 4) ports, then dixon did the final entity based approach. Dave Reid went down the one central table route in his own port.

For most of the last year I have been working with a large client on implementing a deployment and content management solution that is completely dependent on UUIDs. We had to go with #3 over #2 due to the clients requirements which are somewhat similar to heyrocker's client. The 2 things they insisted on having were a full audit trail (financial services client) and they needed to be able to control the contents and placement of blocks (BEANs + context). LSD wasn't around back then.

For this project we decided to use serial ids in our URLs to keep things simple, but for embedding links to other entities we created the UUID Link module which allows the links to continue to work as the entities are deployed through various environments. It works best when used with WYSIWYG (sorry it only supports CKEditor atm).

I am concerned about the performance impact of one big UUID table or joining on char(36) columns. At the same time UUID Link isn't the most lightweight solution. The approach taken by deploy for D7 is to only do the mapping when the deployment happens. To me this seems like a sane trade off. Unfortunately deploy currently relies on uuid_services which breaks the normal services CRUD interface. I haven't kept up with where WSCCI is at with entities and UUIDs.

In terms of UX with node/[int] vs node/[uuid], for most people they are both bad - which is why we have pathauto. I suspect the performance trade off from allowing both in the menu callbacks wouldn't be too great and a config option could be used to default to one or the other. The default should be ints, so new users with simple sites don't end up with the uglier UUID based URLs before they install pathauto.

Crell’s picture

Configurable per-site would greatly complicate the router, and probably confuse Google if somehow both were enabled (and you know someone is going to want to do that). And when Google gets confused, people lose their SEO placement *pinky maneuver*.

sun’s picture

I really wanted to see whether #14 is possible. Attached diff shows what I found.

Short version: It works. But assumes/requires consistent and proper use of the Entity API in every spot of the code-base.

The entity system itself and the field system apparently are fully prepared for this already, since they do not expect/require entity IDs to be serial IDs, by design.

Proof:

entity-info-uuid.png

:)

Berdir’s picture

The entity system itself and the field system apparently are fully prepared for this already, since they do not expect/require entity IDs to be serial IDs, by design.

The entity system might be, but the field system is certainly not.

function _field_sql_storage_schema($field) {
  $deleted = $field['deleted'] ? 'deleted ' : '';
  $current = array(
    'description' => "Data storage for {$deleted}field {$field['id']} ({$field['field_name']})",
    'fields' => array(
      'entity_type' => array(
        'type' => 'varchar',
        'length' => 128,
        'not null' => TRUE,
        'default' => '',
        'description' => 'The entity type this data is attached to',
      ),
      // ....
      'entity_id' => array(
        'type' => 'int',
        'unsigned' => TRUE,
        'not null' => TRUE,
        'description' => 'The entity id this data is attached to',
      ),
jbrown’s picture

Note that with innodb, tables shouldn't have a long primary key: http://dev.mysql.com/doc/refman/5.6/en/optimizing-innodb-storage-layout....

mdupont’s picture

Another scenario I encounter where UUIDs are useful: consider configuration import/export, where nodes and menu links can sometimes be considered as configuration. In this case, deploying the configuration on other environments should include creating the nodes / links. Think of UUID Features module.

For instance, if I'm developing a new section for my website the configuration can include new content types, module dependencies, default configuration, views, etc, but also a new landing page (node) and its associated top menu link. In this case the node is more configuration than content and it would be good to be able to automate its deployment along with other configuration.

Even if the UUID is not used in the URL or in reference fields when stored in the DB, they could replace the serial ID as the identifier when exported as code, and the import routine would recreate a node with a different serial ID and set reference fields data to corresponding serial IDs. There would then be increased load only during export / import operations, everything else could still use serial IDs.

I do not take into account links in text content in this scenario since in this case everything is under control of developers.

Robin Millette’s picture

Since UUIDs are unique, I could imagine the system URLs to be like http://example.com/4bd1c937-f530-456f-a218-241d7bfd40b8 and a core pathauto-like mecanism would make that available at http://example.com/node/66 if it's a node or at http://example.com/user/33 for a user, etc. but only create these latter paths is the "nostalgia" option is set. In most cases, the UUID paths will also be available at another token-based or manually set paths and we will soon forget the days of node/66.

Crell’s picture

Just to remind people, multiple paths that point to the same content is a great way to get Google to hate you. Building a "Google, please hate me" path structure into Drupal is a non-starter on its face.

plach’s picture

@Crell:

We have canonical URLs for that, we can easily have:

  • http://example.org/entity/4bd1c937-f530-456f-a218-241d7bfd40b8
  • http://example.org/node/1
  • http://example.org/about-us

pointing at the same content and define which form is the canonical one, we might even make that configurable.

However, it seems we have two parties here: those who totally would like the have UUID-based URLs and those who absolutely don't. Do you think the new router system could allow for pluggable entity URI patterns?

Crell’s picture

Not on its own, no. It does separate out the route name from the path, which would help, but you still have to pass in a placeholder value. So $generator->generate('node_page', array('node' => 5)); vs. $generator->generate('node_page', array('node' => 'abc123')); That wouldn't be any more or less easy to map back and forth than doing that sort of futzing is now, but remember any futzing we do there is along critical path so we have to be careful with it.

fago’s picture

Just to remind people, multiple paths that point to the same content is a great way to get Google to hate you.

I don't think that's a problem as you long as you specify the canonical URL meta tag for HTML files. For non HTML responses it's also fine if you respond to multiple URLs, it's just important that you refer to the same entity to the same URI even if you respond from a different one (it's about identifing right). That said, I could see us going with regular node/nid URLs + node/UUID + url-aliases in paralell.

heyrocker’s picture

Status: Active » Closed (won't fix)

Well, it appears from this discussion that there is a consensus that UUIDs in end user URLs is unacceptable. Also I don't see any movement on Pathauto for core to resolve that in any meaningful way. So it appears to me this is a won't fix.

moshe weitzman’s picture

Status: Closed (won't fix) » Postponed

If we ever do get pathauto in core, I think we should unpostpone this and discuss further.

sun’s picture

@floretan also ran into the limitation of Field API assuming/requiring serial/integer entity IDs:
http://wunderkraut.com/blog/remote-entities-in-drupal-7/2012-10-25

Thus, I've created:
#1823494: Field API assumes serial/integer entity IDs, but the entity system does not

skwashd’s picture

The D7 UUID module now has support for UUID URI resolution. It just redirects to the normal URI for the entity, but it works. Maybe when this gets picked up again it will provide some inspiration for someone. http://drupalcode.org/project/uuid.git/commitdiff/dce90982a50c5444cc8756...

Jānis Bebrītis’s picture

Status: Postponed » Active

Hang on, UUID implementation never got trough? I had impression that this issue will finally be solved with drupal8.

This is a major blocker for content staging, there is no way to implement correct workflow without major hacking. If we have UUID's on the other hand, it should be quite simple to create node/123 functionality with custom code.

It's not just a matter of "I'm used to this, it's easier", It's for drupal's growth and we all need this. We have only few days to push this trough and make drupal 8 a more mature system, think about it.

please, correct status of this issue if i'm wrong.

heyrocker’s picture

Status: Active » Closed (won't fix)

UUIDs are being added to all entities, they are just not being used for those entities' URLs.

In #30 this was postponed to wait and see if Pathauto got into core, which is not going to happen, so going back to won't fix.

klonos’s picture

Version: 8.x-dev » 9.x-dev
Status: Closed (won't fix) » Active

...just to be in tandem with #229568: Pathauto in Core ;)

klonos’s picture

Issue summary: View changes

Updated issue summary.

moonray’s picture

At least for the performance part of using UUID as primary keys, would something like this increase performance enough to make it viable?

http://kekoav.com/posts/uuid-primary-key-mysql

A UUID [...] is 36 characters long, including dashes. If you store this VARCHAR(36), you're going to decrease compare performance dramatically. This is your primary key, you don't want it to be slow.

At its bit level, a UUID is 128 bits, which means it will fit into 16 bytes, note this is not very human readable, but it will keep storage low, and is only 4 times larger than a 32-bit int, or 2 times larger than a 64-bit int. I will use a VARBINARY(16) Theoretically, this can work without a lot of overhead.

EDIT: Another article features some performance testing data http://blog.simplicitymedialtd.co.uk/437/storing-billions-of-uuid-fields...

moonray’s picture

Issue summary: View changes

...adding #229568 to the list of related issues and shorting the list chronologically ;)

mikeryan’s picture

+1 for eventually using UUIDs as the primary unique identifiers for content. Serial IDs, which do not globally identify an entity, are problematic for migration: #2748609: [meta] Preserving auto-increment IDs on migration is fragile

larowlan’s picture

mikeryan’s picture

That issue would appear to be a dupe of this one, actually:P. Although, the ambitions of this issue (along with my own) are a bit broader...

larowlan’s picture

the other one was rtbc last week

dawehner’s picture

This issue seems to be more about the discussion of requiring UUIDs by default, the other one enables them. It would be indeed quite stupid, to mark the other one as duplicate, because this one has no traction in the recent time.

colan’s picture

Status: Active » Postponed

If this issue is about making that one mandatory, then we need to wait until it's completed.

catch’s picture

Project: Drupal core » Drupal core ideas
Version: 9.x-dev »
Component: entity system » Idea

This is a major change with both framework and UX implications, so moving it to the ideas queue.