Ok, this post is gonna be long and include various graphics. Might want to grab a cup of coffee. Also, disclaimer: This is just documentation of what I've learned from spending way too many hours staring at this stuff, and reflects my current understanding which may not be 100% correct. But it's a place to start talking, anyway. :)
This weekend, Dave Reid and I had a chance to "sit down" on IRC and go over the architecture of the XML Sitemap module, some of its flaws, and how to generally make the module scale on high-performance sites. Dave has been toiling on some code in his sandbox that basically represents a ground-up re-write of the xmlsitemap module with various architecture improvements. I want to share some of those here with you so that we can brainstorm about how (and I guess if) to bring those into the XML Sitemap module.
Let's start at the data model. Here's the data model for Sitemap module:
And here's the one for XML Sitemap module:
Both are pretty similar in both their form and function. Each module that adds links to the sitemap has its own table where it tracks its data, and there's a central "sitemap" table that holds an aggregated view of the entire thing. When the cache files are generated, they pull data out of the central table rather than having to calculate it all the time.
Sitemap has pulled out some columns, such as the lid field (opting to make both "type" and "id" a composite key) and the perplexing "sid" field (what is that for, anyway?). Sitemap also archives *all* of the link properties in both places so that it's easy to tell what's changed.
Where the modules really start to differ is when we look at their "populate sitemap vs. serve sitemap" logic. Observe a flowchart for sitemap module (note: I haven't drawn flowcharts like this for like 5 years so forgive me if they're off ;)):
All of the expensive operations -- generating the sitemap cache files, populating legacy content into the sitemap tables, etc. -- are done on cron. Additionally, it does some neat tricks like switching to an anonymous user *while* gathering the list of links, populates legacy content with newest stuff first (so that if you have 100K nodes being processed 500 at a time, the stuff you posted yesterday will end up in the sitemap first, and the stuff you posted 10 years ago can wait a few days). But best of all, sitemap_output() is *really* simple: it merely reads the file out of cache and renders it to the browser. Done and done. This is important because the primary audience for sitemap.xml is search engines, and you want them to get results quickly.
Contrast that with a flowchart of the existing XML Sitemap module, which has some issues in this regard:
The first major problem that I see is that xmlsitemap_output() does *not* only output the sitemap. It seems to be a "jack of all trades" callback, which both generates *and* displays the contents of the XML file. While I can see how this could work fine on a site with 50 nodes, this is not a good thing on a site with several thousand, because that means Google's going to sit around waiting for a bunch of XML files to get generated instead of getting the sitemap contents when a site has 100K nodes. Instead, it should be able to grab the 2MB+ file as quickly as possible.
I also notice some issues with the way legacy node content is populated into the tables. Rather than using something like db_query_range() to select only X number, all published nodes are selected into a result set. That result set is then looped through, one at a time, and added to the sitemap. The problem is, that list of nodes is not run through anything like db_rewrite_sql() to see if the anonymous user has access to it, which I think is why XML Sitemap does that awkward thing of only allowing user 0 to see sitemap.xml.
And then just in just in general, the code in sitemap module is a bit more compartmentalized and easy to understand, at least for me. A lot of the patches I've filed so far have just been a result of me struggling through xmlsitemap's code and going "Wait. What? Why does it work that way? Shouldn't it be this?" Kiam has done an amazing job of answering all of my questions (and committing lots of improvements!), but I still can't help thinking that the XML Sitemap module is a bit more complex than it needs to be, and could be better broken up into smaller functions that each do one thing and do it well.
Now, as far as where to go from here.... well... let's talk about it. :)