Rule: ScrapingBot Crawler (AI Interpolator ScrapingBot)
Base data:
Summary:
The ScrapingBot Crawler will take a link field and scrape that webpage and then extract the HTML the user asks for using the ScrapingBot service. It can dump the whole page, get some specific DOM selectors or use the Readability library to figure out what is the main text.
If you will only scrape server-side rendered webpages from your own computer/server, look into AI Interpolator Simple Crawler that does the same thing, but for free.
Module needed:
AI Interpolator ScrapingBot
Field types to populate:
Text (plain, long) field (core field).
Text (formatted, long) field (core field).
Base Fields types to use as context:
- Link
Extra Requirements:
You need a ScrapingBot account.
If you want you can use the code DRUPALAI for 20% off the price the first month. This message pays for testing and development of the module.
Extra Settings:
None
Extra Advanced Settings:
Use Chrome
This will make sure that you use a browser instead of just using a network scraper that does not render the website.
Wait for Network
Check this if you want to wait for most ajax requests to finish until returning the Html content when using Chrome. This can slowdown or fail your scraping if some requests are never ending.
Proxy Country
Set the country to proxy the request from. Very useful for instance if you are scraping American websites that does not adhere to GDPR and just block the website.
Use Premium Proxy
Uses a Premium Proxy to scrape websites that are aware of server IPs. For instance use this if you spider Rakuten or Netflix. Note that this costs 10 times as much credits, or 20 times when JS rendering is on.
Crawler Mode
This decides what parts of HTML to get from the link. The options are
- Raw Dump - it takes everything including meta data outside of the body.
- Article Segmentation (Readbility) - this tries to extract the article text from an article and only keep those HTML elements.
- HTML Selector - this searches using DOM traversal after some tag and removes those. (BETA - will be replaced with selectors)
Tag to get
If HTML Selector is chosen. Add the tag to get - for instance body
Tags to remove
If HTML Selector is chosen. Tags to remove - usually script and style tags are uninstresting for instance or might even break the HTML rendering.
Possible example use cases:
- Any type of job or workflow where you need to scrape a webpage for content. There are thousands of workflows where this is the start of the workflow for gathering context.
- Also works as the second step of a Google Search using AI Interpolator SERP for any research workflow where you need to do something like getting the first 10 results for a search word, scrape it and use as context to answer some question using OpenAI.
Help improve this page
You can:
- Log in, click Edit, and edit this page
- Log in, click Discuss, update the Page status value, and suggest an improvement
- Log in and create a Documentation issue with your suggestion