This forum is for less technical discussions about the Drupal project, not for support questions.

new module in development: scraping data from web sites & importing it as nodes

Well for the record: I finished the first prototype of a new module "scraper" in November, but have not contributed it yet. Currently you run this on a test/dev site then import the data into your production site. Requires: Drupal 4.7, PHP5, Tidy extension.

Here's how it works.
(1) You define the dynamic content you want to import into your website. Maybe it's prices of some products. Maybe it's weather data. Maybe it's your prospective fiancee's evolving criminal record.
(2) You create a new "scraper job". Using a combination of config settings, XPath and PHP scripting, you encode this job w/ all info to get the desired data.
(3) The output of this module is currently a CSV file (though I could easily spit out some flavor of XML or whatever).
(4) You print out the CSV file, sit on it, and spin. Or you use node_import to import the data as nodes. Whatever you want to do with it (as long as its legal).

I have this sucker working for all of my ~dozen test sites. It has some pretty slick capabilities
* ability to post form info (logins, search forms, etc),
* ability to read page/form info and traverse multiple pages of data,
* ability to recursively call other scraper jobs, to get data that is spread across multiple pages,
* ability to use regular expressions (& some regex helper functions) to extract e.g. phone number, dates out of a text field

I have worked with & reverse-engineered 3 commercial products in this space, and used these learnings in this product. But it is still in pre-beta mode.

Search Engine Optimization beyond clean URLs and path_auto

Im wondering what other aspects of our Drupal site we can fine tune for optimal search engine optimization.

We have clean URLs setup. We have the path_auto module setup creating filenames based on node title. We've tweaked the page title settings to display the headline of a node as the page title.

Beyond this, are there best practices for Drupal and SEO?

For example, would one suggest modifying page.tpl.php (or template.tpl.php) so that the content portion of a page appears above the sidebar blocks in terms of HTML code.

Recruiting

I am looking for about 5 people to help me create and mantian a drupal help site to help drupal users out as well as give out guides, modules, and templates. if you are intrested please e-mail me at capnmacy@yahoo.com -thanks

Is there a module for this ?

Is there a module or rather, how would I go about implementing a format translation where wp:[this and that] would translate to a wikipedia search for 'this and that'.

better still how could firefox's smart keywords be used as template for easy linking ?

Primary & secondary links in comparison to a menu module ...

Hi ,

I am still checking out drupal and I have noticed the primary and secondary link rows.
It seems to me that this is just a horizontal menu, while the menu module will show a links in a block ...
Can anyone tell me why the primary and secondary links are available while a menu could be showed by a module of choice ?

GDev

Integration with other php projects

I've got another Drupal site to work on, but this one might also have a PHP bulletin board or wiki as well. So I have a few questions regarding integration.

1) URLs
If I want to use clean URLs on Drupal and the bulletin board, how do I avoid URL conflicts? Basically I need to make sure there's no node ever created at the path /forum/. Can I create that node and have it redirect? Can I somehow put that path on reserve?

Also when creating menu items it says the paths should be relative to the Drupal installation, but what if I need a primary link or a menu item to point to something outside of Drupal, or another site altogether?

2) Should I just use Drupal for everything?
I'm considering using wiki and bulletin software that's much more mature and full featured than Drupal's forum or wiki (which isn't out for 4.7 yet) capabilities. I'm not really sure of the pros or cons of that choice. Obviously an all Drupal solution would be easier to launch, maintain, etc. But would say mediawiki and phpbb give me a lot more capability, especially as the usage grows?

3) How can I create a common theme?
Can I get everything to use common CSS files? I've noticed that some other php packages use tpl files, but are these the same as what Drupal has? Anyone know if it would be extremely difficult to get everyone talking the same type of HTML and class tags so that they share CSS?

Pages

Subscribe with RSS Subscribe to RSS - General discussion