Collecting weblinks [#316087]

Feature that may (or may not) be interesting:

Collecting weblinks when posts are submitted: is it possible to collect url's from submitted posts and add them to some sort of moderation queue?
Collecting feeds when weblinks are submitted: when a weblink is submitted we should be able to check if that website has a feed and submit that feed through feedapi

This could help to gather weblinks and retrieve content through feeds that are related to the content users submit (it takes way to long to do all of this by hand, i've tried it once).

Comment	File	Size	Author
#23	weblinks_collector.module.zip	6.99 KB	sevanden

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 2 October 2008 at 13:46

I would think that #1 could be accomplished through a filter that is basically a clone of the core URL filter. It could be provided as an add-on to Web Links so those that don't want it don't have to use it. However, the URL filter doesn't catch links that are in anchor text, so that part would need to be extended. I would then envision saving it as an unpublished weblinks node that the admin could see with the Unpublished block or page.

I know nothing about feeds, so I would not be able to do #2, but perhaps Rmiddle understands them.

Comment #2

sevanden CreditAttribution: sevanden commented 2 October 2008 at 15:35

Thanks for the quick reply.

I'm not a php-programmer yet but i've been checking the web learning about feeds and checking drupal dev, unfortunately i still need to learn a lot before i'd be able to contribute something. I'm playing around with this idea right now, should something positive come out then i'll let you know. I'll keep an eye out here, just in case you're faster than me (wich will probably be the case).

Of course you're right that those links should not automatically be published and be reviewed by a moderator first, the same would go for feeds also.

Keep up the good work, i like this module

Comment #3

rmiddle CreditAttribution: rmiddle commented 2 October 2008 at 23:08

The funny thing is I was thinking the 1st one would be hard but the 2nd one would be that hard.

Thanks
Robert

Comment #4

sevanden CreditAttribution: sevanden commented 3 October 2008 at 07:06

Have been playing around a little, have some code already that scans a post when it is being saved and extracts urls from it.
Next i need to figure out how to retrieve the description of the site and how to submit it as a weblink node.
I must say that this seems to be rather a slow running thing, it takes a while to complete

So, once i have collected all the data needed to create a weblink, how do i actually have it submitted as 'weblinks' node?

Comment #5

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 3 October 2008 at 14:23

Well, Sebastiaan, the first step would be to see if it already exists by checking the "weblinks" table. Then build the data with all of the appropriate fields and call node_submit. For an example of doing something very similar, take a look at the Faq_Ask module which does this same type of thing (even the unpublished) for the FAQ module.

I shouldn't think it would be that slow. Have you looked at the core URL Filter code?

Comment #6

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 3 October 2008 at 15:25

It would be nice if the publish/unpublish was an option.

And extra kudos if there is an option for replacing the link with a weblinks filter entry.

Comment #7

sevanden CreditAttribution: sevanden commented 3 October 2008 at 16:57

I'm not there yet, don't even have an admin form yet ...

But a little stuck on this part for now, it's being executed during 'insert' with nodeapi.

$node = array('type' => 'weblinks');
$values['title'] = $weblink['title'];
$values['url'] = $website;
$values['body'] = $weblink['description'];
$values['weight'] = '0';
$values['author'] = 'root';
drupal_execute('weblinks_node_form', $values, $values);
$errors = form_get_errors();

I get an error when a story with links is being submitted: The post could not be saved.

Still have lots of code clean up to do, it's not as easy as i had hopen (maybe because i never wrote a single line of php code before)

Comment #8

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 3 October 2008 at 18:08

Hmm. I hadn't considered using nodeapi - but don't forget "update" (a node could have a URL added after it is originally created). If you use filter code, an admin form is 80% complete for you. As a matter of fact, a major portion of the whole module is available just by copying the URL filter.

I'm not a big fan of the "drupal_execute" method. I prefer a simple "node_submit," which I think is more straightforward and easier to use. I think it is also less susceptible to core changes.

Comment #9

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 3 October 2008 at 22:29

/**
 * Function to build a Web Links node.
 */
function weblinks_collector_make_node($link, $unpublish) {
  global $user;
  $node = array(
    'type' => 'weblinks',
    'body' => $link['body'],
    'title' => $link['title'],
    'created' => time(),
    'uid' => $user->uid,
    'name' => $user->uid ? $user->name : variable_get('anonymous', t('Anonymous')),
    'status' => !$unpublish,
    'format' => variable_get('weblinks_format', FILTER_FORMAT_DEFAULT),
    'comment' => variable_get('comment_weblinks', 0),
    );
  $node_options = variable_get('node_options_weblinks', array('status', 'promote'));
  foreach (array('promote', 'sticky', 'revision') as $key) {
    $node[$key] = in_array($key, $node_options) ? 1 : 0;
  }

  // Okay, let's get it done. Node_submit will prepare it and make it an object.
  $node = node_submit($node);
  node_save($node);

  drupal_set_message(decode_entities(t("Created Web Links: %link.", array('%link' => l($node->title, 'node/'. $node->nid)))));
}

Comment #10

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 3 October 2008 at 22:51

Sebastiaan, please contact me through my contact form.

Comment #11

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 4 October 2008 at 15:12

Sebastiaan, please contact me through my contact form.

I did sit down and spend a little time coding this as a filter module. Now that it works, I discovered that it is the wrong way to approach it. (However, the code may be helpful to you.) Probably your hook_nodeapi approach is better.

There are some other issues as well.

Drupal guidelines call for filtering on OUTPUT (and you are filtering however you do it). This is probably a case where that rule needs to be violated, as I will describe.

The big problem with my filter technique is that every time the cache is cleared and that content is viewed, it will be re-filtered. This means that the collection will happen all over again, resulting, potentially, in many duplicate links. When is the cache cleared? Some examples you might not suspect are: every time a node is created or deleted, and any time the taxonomy is altered, and any time you run update.php.

The only ways I can see to avoid this is to either keep another table indicating which nodes have been processed or to actually modify the content on input (violating the guidelines). This may very well confuse (or anger) the users if they go back and edit the node later.

Having said all that, I still like the idea. It just needs more thought and a lot of testing.

Comment #12

sevanden CreditAttribution: sevanden commented 5 October 2008 at 06:00

You're absolutely right about the filtering issue, the code would run over and over again. I think the best way to avoid this is to actually scan the node when it has been submitted (and passed validation), that's why I would prefer to use _nodeapi function instead.

I've also noticed the duplicate link issue and I'm added additional checks to my code for this (but if I'm not mistaken the Web Links module has some checks for that already). Aside from that, I think we need to identify the front page of the URL if we want to keep things clean.

I'm currently still working on some code to retrieve meta data from the websites which might prove useful to fill in the description and title field of the link. I was even thinking about retrieving keywords and have those suggested as terms for the link, but that's for much later.

I'm also trying the best I can to document my code as much as possible.

I was rather surprised how quickly a simple idea can become such a complex code (first time programming with PHP and first attempt to contribute to Drupal).

Comment #13

sevanden CreditAttribution: sevanden commented 5 October 2008 at 23:24

I've been thinking a little more about this little project and came to the conclusion that this module might be taking up to much time running. If someone has his workflow for stories set to publish, feeds set to publish, then for every node created by feeds this module would be running through all the code. So', I think I should rewrite my code to use the job_queue module and have several parts executing at cron runs instead of completely at the moment a node is being inserted into the database. This would especially be true if the module collects feeds also and those feeds are being refreshed on creation ...

Do I see this right here?

Comment #14

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 6 October 2008 at 01:50

Yes, you could be right. You could always set it so that only certain content types are checked. If you do elect to do this in hook_cron, you must be careful not to run over the maximum_execution_time limit, which for many people is only 30 seconds. In the filter module I sent you, the time it took to run was not noticeable, even when creating two other nodes.

Comment #15

sevanden CreditAttribution: sevanden commented 19 October 2008 at 07:44

Have something working already now (http://monkeyboy.plesk3.freepgs.com), still need to fix some code for redirects (there seem to be a few irregularities) and some additions need to be made for feeds but I'm making progress.
I'm not really sure how to handle the taxonomy problem yet, it's not so simple to categorize links or feeds that are collected.
I was also thinking if it might be a good idea to include support for nodewords based on the keywords found.

Comment #16

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 19 October 2008 at 11:54

I would guess that most people will want to moderate collected links, so they can categorize them at that time. BTW, how about assigning this issue to yourself?

Comment #17

sevanden CreditAttribution: sevanden commented 19 October 2008 at 13:45

I'm still working on this idea, I just came to the conclusion also that it would be a good idea to have a blacklist against which to verify collected links (nobody wants to collect junk) and this may even prove usefull for the main weblinks module I think. I'll have to think about the best way to do this, suggestions are always welcome.

The module is working in such way you can control the workflow of the collected weblinks, perhaps it will be also a good idea to make this work together with the modr8 module. I'll look into this later ...

I'm currently working on the part for feeds, which I want to have following the same workflow idea of the links. But I'm also thinking of only collecting feeds after a weblink has been approved ...If someone has experience with feedapi I could use some suggestions on that too.

Comment #18

NancyDru

she/her/hers

English

Boston

CreditAttribution: NancyDru commented 20 October 2008 at 01:08

Well, two ideas spring to mind right off the bat:
1) Add a flag to the weblinks table to indicate the link is blacklisted.
2) Add a term to the vocabulary for blacklisting.

As I understand it, working with Moder8 is simply a matter of specifying the "moderate" flag when saving the node.

Feeds are beyond me.

Comment #19

sevanden CreditAttribution: sevanden commented 20 October 2008 at 06:55

Thanks for the quick reply.

I've been cleaning up the code I have at this time and bug-hunting. Once that is done, and i find it to work within acceptable limits the package will be made available for download. Tried to apply for a CVS account, but was decline because there was no download link ...

The feed part I think I have it figured out for only a very small portion ... but it's working by respecting the default workflow of the feed node type

Anyway, I hope once it's available here to get some more ideas and feedback to improve this little spider.