ScrapeNetwork

Mastering Playwright: Comprehensive Guide on How to Block Resources

Table of Contents

Table of Contents

By utilizing the request interception feature in Playwright, we can significantly enhance the efficiency of web scraping efforts. This optimization can be achieved by blocking media and other non-essential requests, which often consume unnecessary bandwidth and slow down the scraping process. Incorporating a web scraping API into your Playwright projects can further streamline this process, providing robust tools to manage and refine data collection. By focusing on essential content and utilizing advanced APIs for web scraping, developers can ensure faster, more accurate data retrieval from complex web pages.

from playwright.sync_api import sync_playwright

# block pages by resource type. e.g. image, stylesheet
BLOCK_RESOURCE_TYPES = [
  'beacon',
  'csp_report',
  'font',
  'image',
  'imageset',
  'media',
  'object',
  'texttrack',
#  we can even block stylsheets and scripts though it's not recommended:
# 'stylesheet',
# 'script',  
# 'xhr',
]


# we can also block popular 3rd party resources like tracking and advertisements.
BLOCK_RESOURCE_NAMES = [
  'adzerk',
  'analytics',
  'cdn.api.twitter',
  'doubleclick',
  'exelator',
  'facebook',
  'fontawesome',
  'google',
  'google-analytics',
  'googletagmanager',
]

def intercept_route(route):
    """intercept all requests and abort blocked ones"""
    if route.request.resource_type in BLOCK_RESOURCE_TYPES:
        print(f'blocking background resource {route.request} blocked type "{route.request.resource_type}"')
        return route.abort()
    if any(key in route.request.url for key in BLOCK_RESOURCE_NAMES):
        print(f"blocking background resource {route.request} blocked name {route.request.url}")
        return route.abort()
    return route.continue_()

with sync_playwright() as pw:
    browser = pw.chromium.launch(
        headless=False, 
        # tip: you can enable devtools so we can see total resource usage (bottom left corner)
        devtools=True, 
    )
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    # enable intercepting for this page, **/* stands for all requests
    page.route("**/*", intercept_route)
    page.goto("http://some-webpage.com/")

Resource blocking can lead to a significant reduction in bandwidth usage, often by 2-10 times! However, it’s important to remember that blocking functional resources like stylesheets, scripts, and xhr could potentially impact the web scraping process.

Related Questions

Related Blogs

Python
In the intricate dance of web scraping, where efficiency and respect for the target server’s bandwidth are paramount, mastering the art of rate limiting asynchronous...
Playwright
Utilizing Playwright for web scraping enables us to navigate pages with infinite scrolling, where content dynamically loads as the user scrolls down. To automate this...
HTTP
Python offers a variety of HTTP clients suitable for web scraping. However, not all support HTTP2, which can be crucial for avoiding web scraper blocking....