indexer issue [#320044]

I am not entirely sure if this is a sphinxsearch issue or indexer issue, but it appears to lie somewhere between the two...

I have been able to build about 10 different indices for a particular site (not live yet).

Data has now been modified across board, so I chose to rebuild the all indices again. Each one is built individually, pulling about 100k or 150k nodes back using a sphinx.conf file similar to the sample supplied in this module.

Now a index that was built in the past, is now failing with new data giving errors like:

- XML parse error: no element found {line=1578705, pos=0 docid=198691} ...

- "unclosed CDATA" ....

At the time it happens, I know the machine is close to php memory limit... So that is what I am not sure if its a memory issue.

I do not want to break up node ranges since over 1.5 million, and currently have 15 indices in conf file...

Please do you have any ideas? Thanks

Comments

Comment #1

markus_petrux commented 11 October 2008 at 20:21

A few ideas to try to sched some light:

1) Try invoking the XMLPipe command from the shell redirecting the output to a file. Then open the file to see how that line looks like. Maybe there's a PHP error, the file is interrupted, or something...

2) docid=198691 is the node ID. You could create a temporary main index with XMLPipe command where first/last node ID is that one. If that works, then it looks like the problem is somewhere else.

3) Try looking at admin/logs/watchdog filtering sphinxsearch reports. Maybe there's something there that it was missed before.

Comment #2

zeezhao commented 11 October 2008 at 22:23

Thanks for your suggestions.

I ran a temporary index on a node that was failing and it still failed. By the way, it seems to fail on the (node + 1)

So in another example, it reported failure on docid=208218, but only failed when node=208219.

So then ran the wget manually and redirected it to a file. Hard to tell what the issue is since I'm not familiar with it, except some funny characters... Maybe something to do with charset_table?

Using: charset_type= utf-8, and charset_table exactly what you had in sample file.

watchdog had nothing extra.

Thanks for your help.

Comment #3

zeezhao commented 11 October 2008 at 22:27

Ok, looks like that node had a funny character ""...

willl have to figure out how to handle this as I have no control over the data... and 1.5m nodes.... Any ideas

Thanks for tips.

Comment #4

markus_petrux commented 12 October 2008 at 00:55

hmm... this seems to be \x1A which is certainly invalid for XML:

http://www.w3.org/International/questions/qa-controls

However, these are filtered by sphinxsearch_get_node_text(), located in sphinxsearch.module. But... this is only applied to build the "content" attribute of the XMLPipe document, it is not applied to the "subject" attribute. See sphinxsearch_xmlpipe_document() located in sphinxsearch.xmlpipe.inc. And this would be a nice bug :)

If you can take a look at the contents of the file you got when running the XMLPipe command from shell, probably the \x1A character is located within the <subject> element of the document. Could you please confirm?

If that's correct, the you could try the following:

1) Open the file sphinxsearch.xmlpipe.inc.
2) Find the function sphinxsearch_xmlpipe_document().
3) You'll seee something like this:

$output .= '<subject><![CDATA[['. check_plain($node->title) .']]></subject>' ."\n";

4) Before that line, add the follwing:

$node->title = preg_replace('#[\\x00-\\x08\\x0B\\x0C\\x0E-\\x1F]#S', ' ', $node->title);

Works?

If so, I'll post a patch and also create a new release for the D5 branch.

Comment #5

zeezhao commented 12 October 2008 at 08:22

Thanks for your reply. Yes, its in the element, and your fix works! The prorgam now progresses pass the node causing the error.

Since I am indexing 1.8million nodes, I will let you know if I find anything else that can help improve the program. My first run of 1.3 million nodes last week went fine though. It took 11 hours or so to run...

The only other comment: in sphinxsearch.common.inc, I found the line using node_load()

I changed it temporarily to: node_load($sphinx_match['attrs']['nid'], NULL, TRUE)))

so that it does not cache. Please comment on this. The reason I did this is that in node_load.module - another drupal module, I had to do something similar to conserve on memory...

Once again, thanks.

Comment #6

markus_petrux commented 12 October 2008 at 11:24

Great it worked. :) ...I'll try to pack a new release for the D5 branch, later today, if possible.

node_load() in sphinxsearch.common.inc is used to load nodes for the search results page, so there shouldn't be hundreds of nodes in memory. If you're concerned about this, please open a different issue.

As per indexer optimization... we already use node_load($nid, NULL, TRUE) in XMLPipe processing. Again, open a different issue if you wish to touch this subject. One suggestion from here: look at Drupal 6, in scripts directory, the scripy drupal.sh. It allows running Drupal from the CLI. This is something I discovered rencently, and I'll probably advice it's usage for the Drupal 6 version of this module. AFAICT, drupal.sh script is Drupal version agnostic, so you could get it from D6 and try it on your D5 installation.

Comment #7

markus_petrux commented 12 October 2008 at 11:24

Status:

Active

» Reviewed & tested by the community

Comment #8

zeezhao commented 13 October 2008 at 07:17

Thanks. I will leave node_load() as it.

I have also had further failures which look like as a result of the following character - ":"

The strange thing is the when I run the indexer for the same index and node on windows, it works. But when I run exactly the same on a linux box it fails...

Comment #9

markus_petrux commented 13 October 2008 at 07:39

Could you please generate a file with the result of redirecting the output of running the sphinxsearch_xmlpipe.php script with first/last nids set to the nid where it is failing?

I'll send you my email, so you can send me the file, and I try to see what happens on my system.

Comment #10

zeezhao commented 13 October 2008 at 09:35

Thanks for your reply. I've just emailed you.

I suspect it may also be memory related, as when I run for the node it seems to now work... But it does not when the node is within a 150k range.

Also, on reruning again on linux for the whole range, it now progresses a lit bit further and fails on another node entirely... It also as a ":" within the title. Error message type is ... "XML Parse error: no elemend found...".

So my guess is its either memory or the ":" character... But as per earlier post, I was able to run on windows.

Comment #11

markus_petrux commented 13 October 2008 at 09:51

Hi,

As I sent you from email: "I've checked the XML stream you've sent me, and it also parses correctly on my system."

So I also suspect this is something related with some kind of processing limit. Here, I would suggest:

1) Check watchdog logs (filtering for sphinxsearch reports). XMLPipe generator for main index processing tries to record one report at start, and another at end of process with information about execution time, and memory used.

2) Compare the time it takes your index generation process with the PHP settings defined on your system, and or the .htaccess file stored on the sphinxsearch_scripts subdirectory.

3) Try increasing max_executions_time, memory_limit, mysql.connect_timeout, ...

4) Define one main index with as much nodes as possible. If it fails, try limiting the number of nodes using the first/last arguments of the xmlpipe script. This should help you find out how many nodes can be processed with you current PHP settings. So you can increase them, or otherwise you'll need to use more main indexes with less nodes per index.

Comment #12

zeezhao commented 13 October 2008 at 12:46

Thanks for the tips.

I decided to increase memory limit and go with smaller batches of nodes. This seems to work so far...

Also, note that its the memory limit in sphinxsearch_scripts/.htaccess that actually took effect, not the one in sphinx.conf... It took a while to figure this out. The execution limits, etc were fine.

So it looks like a memory problem...

Comment #13

markus_petrux commented 13 October 2008 at 13:46

Here's one possible method that could help you compute how many main indexes you need, and how many time it will take to reindex them:

1) Create a main index for (say) 1000 nodes and run the indexer.
2) Look at statistics for that process stored in admin/logs/watchdog. This should tell you node processed per second.

Now, if we got it took 50 nodes per second, then we can process 3,000 nodes per minute, or 180,000 nodes per hour. If we have 1.8million nodes, then we need 10 hours to index all nodes in one single index. If we create 10 indexes, then we need 1 hour per index, or maybe more if we rebuild more than index in parallel, depending on resources available on the server where Sphinx is installed (number of physical discs, raid5?, memory, cpus, etc.).

HTH+

Comment #14

markus_petrux commented 29 October 2008 at 22:14

Status:

Reviewed & tested by the community

» Fixed

Fix is included in 5.x-1.3 :)

Comment #15

markus_petrux commented 2 November 2008 at 16:32

Status:

Fixed

» Closed (fixed)

indexer issue

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

News items

Our community

Documentation

Drupal code base

Governance of community