Checking for previous quote entries? [#983400]

Thank you for creating a very good contributed module. I have been using various versions of this module's Drupal 6 code for more than two years now. With it, through periodic entries over the past two plus years, I have now amassed about a thousand quotes on the Lars Toomre personal website. As a whole, this module has performed very well. However, with time, I have observed a couple of items that could be implemented to make it even better. (I consider myself to be an intermediate level Drupal programmer so I am happy to help as best I can.)

Over the past two plus years, I have created content with about twenty quotes by Albert Einstein. Let me use this collection of quotes as an example of the general issues.

The first issue is rather simple. After some time, I sometimes do not recall if I entered quote X previously. Rather than letting me save a duplicate of that quote which was entered previously, can some intermediate step be called to highlight that the new entry is a duplicate? Presumably, this is part of the form validation process but I am unsure how to implement. One could also allow the entry to continue with the right confirmation.
The second is a bit more complicated. I use the citation field in a cascading manner. If a citation is available, that is entered. More generally, the field is used to give a brief description about the quote author. In my use case, I almost always enter the author field before I enter text in the citation field. However, the auto-complete function provides suggestions for all entries ever made in the citation field. Is it possible to limit the select list (at least at first blush) to only those citation entries made previously with that particular quote author? Ideally there would be an option to expand to the whole universe like what would be displayed if this were a new quote author.

Thanks again for a great module. I am happy to help to implement these UI changes and make the module even better for the Drupal community. Please let me know how I can help.

Cheers

Comments

Comment #1

ctmattice1 CreditAttribution: ctmattice1 commented 28 November 2010 at 14:38

Lars, patches are always welcome.

Your first suggestion would require a db look-up of the actual quote. I've run into this many times and have thought about this however this process would also be resource intensive. It also would require no typo's in the new quote. While this would be a neat feature it would not be fool proof but could catch many of the duplicates most of the time.

The second suggestion is a bit simpler, but more resource intensive, instead of running through all citations with autocomplete it could be set to first sort all citations by author and then when that is exhausted by the complete set.

Comment #2

Lars Toomre CreditAttribution: Lars Toomre commented 28 November 2010 at 18:16

Thanks ctmattice1 for your response. I am a bit confused about why checking for an exact match with an existing quote would be resource intensive. Could you elaborate a bit so that I better understand?

I agree that checking all of the existing quotes for the "close, but not exact" case (when no quote author has been specified) could be resource intensive. However, I do not imagine many users of this module have more than a hundred quotes from the same author, let alone thousands. What I am imagining the new code to do is something akin to the following (in the entry form validation section):

[A] First check - Query the data base to see if there is an exact match with the body of an existing quote.

[B] Second check - Query data base to see if new quote is sub-text from existing quote. Warn user if so.

[C] Third check - Naive attempt at closeness
1) Explode the new quote into an array of its component words. (May want to eliminate common words, but that brings in language complexities.)
2) If the quote author has been entered, select all quotes from that author; otherwise select all quotes.
3) For each quote in the 2) selection group, explode that quote into an array of component words.
4) Compute the percentage of words new quote and existing quote share in common.
5) If above a threshold percentage (say 80%), warn the user that quotes are very similar. (How should form processing then proceed?)

Does this make any sense? Your thoughts are welcome.

Cheers

Comment #3

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 30 November 2010 at 03:57

Checking for an exact match could be fairly simple - just use an MD5 sum. However, that would not take into account slight variations (even a comma misplaced).

The problem with limiting the citation list is that the author field has not been presented to the application yet.

From #2, a single query (assuming the MD5 is implemented) could do both of the first two checks, although using a LIKE statement on a database with many, many quotes could be very slow. Yes, there is at least one user I know of who has so many quotes this module is already slow.

Your third check is prohibitively expensive, and even worse if one accounts for multiple languages. Effectively you are asking for a complete search implementation within this module. I don't see that ever happening. If you can find some algorithm that can do this quickly, we might consider it as an add-on.

You must also think about the process of entering multiple quotes at one time.

Comment #4

Lars Toomre CreditAttribution: Lars Toomre commented 1 December 2010 at 17:51

Thanks Nancy for your comments. I very much appreciate your thoughts and they have caused me to pause, think and investigate further. I had not considered the use case of entering multiple quotes at the same time nor did I understand when the author field gets presented to the application.

I probably will be quiet for a bit as I work on my own experimenting and trying to get some alpha code working in a fast and efficient manner. Any further comments or thoughts are welcome in the interim.

Cheers

Comment #5

Lars Toomre CreditAttribution: Lars Toomre commented 8 December 2010 at 17:13

During the past week, I have spent some time investigating and working on this issue, paying particular attention to Nancy's thoughts in #4.

A close review of the existing code found that there is no validation being performed at all on the individual quote import records. I would suggest then that the existing import code be modified so that some minimal level of checking is performed in a private quotes function and then a module_invoke_all() call be made to incorporate any further validation in custom user modules via a custom quotes hook definition.

Before working on a patch to the quotes.module file, I would like to get some thoughts on this approach. Thanks.

Comment #6

Lars Toomre CreditAttribution: Lars Toomre commented 8 December 2010 at 17:54

The checking of a single quote entry for an exact match with those already in the data base was fairly easy. Given that I am reluctant to modify any third-party Drupal code like the quotes.module file itself, I implemented this functionality via a helper module. The key was to add an #element_validate to the quote text field and then call a helper validation function.

My assumption is that a particular user will not want to enter an exact duplicate quote and this condition should be treated as an error. At present, the extraction of data from the form and the actual checking for a match are in the same function. If we want to add duplicate validation to the bulk import side, this function will need to be broken into two parts: 1) deal with $form_state I/O and 2) perform the actual match check and set error flag.

This second function then also could be called out of the private validation function mentioned above. The question there will be what do we want to do if a duplicate is found (Skip?, How to highlight?,etc.) The resulting code can be tweaked so that it can be added to quotes.module file as a patch if there is sufficient interest.

The relevant code from my helper module is:

/**
 * Implements hook_form_FORM_ID_alter() for quotes_node_form.
 */
function (helper_module)_form_quotes_node_form_alter(&$form, &$form_state) {
  // Add custom validation checks for quotes_node_form
  // $form['#validate'][] = '_quotes_validate' ;

  // Add exact match check to quotes body field
  $form['body_field']['body']['#element_validate'][] = '_quotes_body_validate' ;
}

/**
 * Custom quote validation to check for exact match with previously entered quotes.
 */
function _quotes_body_validate($element, &$form_state) {
  $quote = '' ;  $user = '' ;
  if (isset($form_state['values']['body'])) {
    $quote = $form_state['values']['body'] ;
  }
  if (isset($form_state['values']['uid'])) {
    $user = $form_state['values']['uid'] ;
  }

  $sql = "SELECT n.nid,n.title, MD5(LOWER(nr.body)) AS checksum " .
    "FROM {node} n INNER JOIN {node_revisions} nr ON n.vid=nr.vid " .
    "WHERE type='quotes' AND n.uid=" . $user .
      " AND MD5(LOWER(nr.body))='" . md5(strtolower($quote)) . "' " .
    "ORDER BY created DESC" ;
  $records = db_query($sql) ;
  while ($record = db_fetch_object($records)) {
    // Skip this quote if already has been saved once (ie in update mode)
    if (isset($form_state['values']['nid']) &&
        $form_state['values']['nid'] == $record->nid) continue ;
    $link_start = '<a href="/node/' . $record->nid . '" onclick="' . "target='_blank';" . '">' ;
    $text = 'ERROR: This quote is a duplicate of another ' . $link_start .
      'quote</a> entitled ' . $link_start . $record->title . '</a>.' ;
    form_error($element, $text) ;
  }
}

As always, thoughts and comments are welcome. Cheers!

Comment #7

Lars Toomre CreditAttribution: Lars Toomre commented 8 December 2010 at 22:50

A possible failing with the code offered in comment #6 deals with filtering. Drupal has a great policy of never changing user entered content. Hence, we probably need to deal with filtering here as well. For instance, sometimes I will embed URL links within a posting (although generally of the 'blog' type).

Suppose that one is dealing with the quote "Wise men put their trust in ideas and not in circumstances." Occasionally, I will embed a link around the term like 'trust' that connects with some specific external page about this quote/term. On other quote pages, I might use the characters '--' or '—' to represent a long dash in the middle of a quote. The current code assumes that one is dealing with "pure" text both on what has been entered for the current quote and what is stored in the data base. Hence, the same quote with a link and one without would be considered to be different. That obviously is wrong.

A probable improvement to the validation routine involves "cleaning up" both the stored values and the new quote text string before the md5 comparison is conducted. The "clean up" function would filter out all HTML links as well as "rationalizing" dashes and such. Obviously, this type of function will require more overhead per comparison.

My inclination is to implement this more compute-intensive preprocessing method before the comparison is actually performed. Any thoughts before I proceed?

Cheers!

Nota Bene - This issue came up in pre-processing the quote to gain an accurate word count. Obviously, there 'mdash' should not be considered a word!!

Comment #8

Lars Toomre CreditAttribution: Lars Toomre commented 9 December 2010 at 09:28

This morning I started by cleaning up some open @todo items. One of those concerned breaking out the node comparison functionality so it also could be used in the batch import portion of this module (from Item #6 above). I am posting the resulting code so that others might use it as well.

/**
 * Custom quote validation to check for exact match with previously entered quotes.
 */
function _quotes_body_validate($element, &$form_state) {

  $nid = 0 ; $quote = '' ;  $user = '' ;
  if (isset($form_state['values']['nid'])) {
    $nid = $form_state['values']['nid'] ;
  }
  if (isset($form_state['values']['body'])) {
    $quote = $form_state['values']['body'] ;
  }
  if (isset($form_state['values']['uid'])) {
    $user = $form_state['values']['uid'] ;
  }

  $duplicates = check_for_node_duplicates($nid, $quote, $user, 'quotes' ) ;
  if ($duplicates['number'] > 0) {
    foreach ($duplicates['nodes'] as $nid => $error_message) {
      form_error($element, $error_message) ;
    }
  }
}


/**
 * Checks for duplicate nodes.
 */
function check_for_node_duplicates($nid, $body, $user = NULL, $type = NULL) {
  // Declare return array and structure elements
  $duplicates = array() ;
  $duplicates['number'] = 0 ;
  $duplicates['nodes'] = array() ;

  $sql = "SELECT n.nid,n.title, MD5(LOWER(nr.body)) AS checksum " .
    "FROM {node} n INNER JOIN {node_revisions} nr ON n.vid=nr.vid " .
    "WHERE MD5(LOWER(nr.body))='" . md5(strtolower($body)) . "' " ;
  if (isset($user) && $user >= 0) {
    $sql .= "AND n.uid=" . $user . " " ;
  }
  if (isset($type) && !empty($type)) {
    $sql .= "AND n.type='" . $type  . "' " ;
  }
  $sql .= "ORDER BY created DESC" ;
  $records = db_query($sql) ;
  while ($record = db_fetch_object($records)) {
    // Skip this quote if already has been saved once (ie in update mode)
    if (isset($nid) && $nid == $record->nid) continue ;
    $link_start = '<a href="/node/' . $record->nid . '" onclick="' . "target='_blank';" . '">' ;
    $text = 'ERROR: This text is a duplicate of another ' . $link_start .
      'node</a> entitled ' . $link_start . $record->title . '</a>.' ;
    $duplicates['nodes'][$record->nid] = $text ;
    $duplicates['number'] += 1 ;
  }
  return $duplicates ;
}

Comment #9

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 9 December 2010 at 19:34

In all likelihood, this needs to be a separate (helper) module so that it doesn't impact performance for those who don't want it (this is supposed to be a relatively light module after all). There was no validation before because there was none needed. As for the duplicate flagging, I'd really like to see a form_error raised so that it doesn't get entered at all - not even for a bulk imports. I would also suggest a setting that allows choosing between simple and complex checking as most people are going to do simple text.

Comment #10

Lars Toomre CreditAttribution: Lars Toomre commented 9 December 2010 at 21:37

Nancy,

Thanks for your comments. For me at least, there appears to be no performance hit doing a simple text comparison (running against about 1200 existing quotes). There surely will be more of performance hit when I get the "filtered" text comparison code implemented. I have no sense of how much that performance hit might be though.

The way my code works now a form_error flag is raised for a single quote entry that is a duplicate. I am not presently doing any validation against bulk imports because I am uncertain about how to handle error handling. Let's say a duplicate is determined on the xth of y entries. Should the import process stop with that quote? Back out quotes successfully processed? Continue to end of file without storing data on the error quote?

Thoughts are welcome.

Cheers!

Comment #11

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 10 December 2010 at 00:48

Currently, if an error is detected (in say, delimiters), the process stops and the ones that were already processed remain in the db. Not pretty, but easier to deal with. I'm not sure if there has ever been an issue opened against bulk import, other than mine to change it. The only thing I've ever used it for is export/import to another site; and I don't like that process at all.