Yesterday, I installed Alchemy module with the Autotagging module recommended here.

Loving it already, the tags returned are at least (if not more) relevant than those hand picked by my editors, and obviously generated in a fraction of the time.

One issue I have, and I'm not sure if this is an issue with Alchemy or Autotagging, but some of the returned tags contain HTML entities.

I have the threshold set to 60% relevancy and I'm getting back tags such as "target=", "keyword", "Keyword ", etc..

Wouldn't it make sense to have a checkbox that can be ticket to strip HTML entities from all text before sending it off to the Alchemy servers?

Comments

thedavidmeister’s picture

oh whoops, the html that i was trying to quote came through :P

and nbsp have caused me troubles so far.

thedavidmeister’s picture

oh whoops, the html that i was trying to quote came through :P

and nbsp have caused me troubles so far.

thedavidmeister’s picture

*

thedavidmeister’s picture

simple fix:

line 77, just under

<?php
function alchemy_get_elements($text, $type = 'keywords', $output = 'normalized', $cid = 0, $use_cache = FALSE) {
?>

add this:

<?php
$text = strip_tags( $text );
?>
thedavidmeister’s picture

Status: Active » Needs review

could someone review #4 so we can get it committed?

i find it hard to see how such a simple change could break anything, but someone else might know something i don't about that PHP function.

TomDude48’s picture

I added it, let me know if it is doing what you needed.

TomDude48’s picture

Status: Needs review » Fixed
thedavidmeister’s picture

Status: Fixed » Needs work

it is mostly doing what i need, but characters like slashes and html entities are still being sent.

i expanded it to this:

$text = strip_tags( $text );
$text = html_entity_decode( $text, ENT_QUOTES, 'UTF-8' );
$text = str_ireplace( '&nbsp;', ' ', $text );

and that got most things that were still annoying me, but when i tried to use a regex to strip everything except "word characters" and white space alchemy sent an error.

now that i think about it, that could have had something to do with me sending a 1000 word "sentence" to alchemy after i stripped out all the punctuation.

anyway, i ran out of time to work on this as it is working fine for 90% of our articles with those three lines i posted above, but if you want to take it further you definitely could.

btw, we've noticed that the quality and relevance of the tags returned by Alchemy in general are vastly improved when you send it plain text with simple punctuation rather than html.

TomDude48’s picture

Status: Needs work » Needs review

fixed in latest commit (in dev branch)

technologywon’s picture

Issue summary: View changes
Status: Needs review » Closed (outdated)

Drupal 6 is no longer supported