Web Scraper Drivers

Overview

Web Scraper Drivers can be used to scrape text from the web. They are used by WebLoader to provide its functionality. All Web Scraper Drivers implement the following methods:

scrape_url() scrapes text from a website and returns a TextArtifact. The format of the scrapped text is determined by the Driver.

Markdownify

Info

This driver requires the drivers-web-scraper-markdownify extra and the playwright browsers to be installed.

To install the playwright browsers, run playwright install in your terminal. If you are using poetry, run poetry run playwright install instead. The playwright command should already be installed as a dependency of the drivers-web-scraper-markdownify extra. For more details about playwright, see the playwright docs.

Note that if you skip installing the playwright browsers, you will see the following error when you run your code:

playwright._impl._errors.Error: Executable doesn't exist at ...
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated.       ║
║ Please run the following command to download new browsers: ║
║                                                            ║
║     playwright install                                     ║
║                                                            ║
║ <3 Playwright Team                                         ║
╚════════════════════════════════════════════════════════════╝

The MarkdownifyWebScraperDriver outputs the scraped text in markdown format. It uses playwright to render web pages along with dynamically loaded content, and a combination of beautifulsoup4 and markdownify to produce a markdown representation of a webpage. It makes a best effort to produce a markdown representation of a webpage that is concise yet human (and LLM) readable.

Example using MarkdownifyWebScraperDriver directly:

from griptape.drivers import MarkdownifyWebScraperDriver

driver = MarkdownifyWebScraperDriver()

driver.scrape_url("https://griptape.ai")

Example of using MarkdownifyWebScraperDriver with an agent:

from griptape.drivers import MarkdownifyWebScraperDriver
from griptape.loaders import WebLoader
from griptape.tools import TaskMemoryClient, WebScraper
from griptape.structures import Agent

agent = Agent(
    tools=[
        WebScraper(
            web_loader=WebLoader(
                web_scraper_driver=MarkdownifyWebScraperDriver(timeout=1000)
            ),
            off_prompt=True,
        ),
        TaskMemoryClient(off_prompt=False),
    ],
)
agent.run("List all email addresses on griptape.ai in a flat numbered markdown list.")

Trafilatura

Info

This driver requires the drivers-web-scraper-trafilatura extra.

The TrafilaturaWebScraperDriver scrapes text from a webpage using the Trafilatura library.

Example of using TrafilaturaWebScraperDriver directly:

from griptape.drivers import TrafilaturaWebScraperDriver

driver = TrafilaturaWebScraperDriver()

driver.scrape_url("https://griptape.ai")