ScrapeNetwork

Comprehensive Guide: How to Block Resources in Selenium with Mitmproxy

Table of Contents

Table of Contents

Enhancing the efficiency of Selenium web scrapers involves strategies such as blocking media and superfluous background requests, which can significantly accelerate scraping operations by minimizing bandwidth usage and rendering time. However, Selenium cannot natively intercept and block requests, necessitating the use of an external proxy server for this purpose. One effective solution is leveraging a robust web scraping API, which simplifies the process of managing these requests without the need for extensive setup or maintenance, thus streamlining your scraping projects while ensuring high efficiency and accuracy.

A widely utilized proxy for such tasks is mitmproxy, a versatile tool that can be configured to block specific types of requests either by resource type or name. The initial step involves installing mitmproxy, which can be done via pip install mitmproxy or using the appropriate package manager for your system. Subsequently, a block.py script is crafted to imbue mitmproxy with the desired blocking logic:

# block.py
from mitmproxy import http

# List of common third-party resources to block, including ads and trackers
BLOCK_RESOURCE_NAMES = [
  'adzerk',
  'analytics',
  'cdn.api.twitter',
  'doubleclick',
  'exelator',
  'facebook',
  'fontawesome',
  'google',
  'google-analytics',
  'googletagmanager',
  'images'  # Abstract category like images
]

# Blocking based on resource file extensions
BLOCK_RESOURCE_EXTENSIONS = [
    '.gif',
    '.jpg',
    '.jpeg',
    '.png',
    '.webp',
]

# Custom logic to handle requests
def request(flow: http.HTTPFlow) -> None:
    url = flow.request.pretty_url
    if any(url.endswith(ext) for ext in BLOCK_RESOURCE_EXTENSIONS) or 
       any(block in url for block in BLOCK_RESOURCE_NAMES):
        print(f"Blocked {url}")
        flow.response = http.Response.make(
            404,  # HTTP status code
            b"Blocked",  # Response body
            {"Content-Type": "text/html"}  # Response headers
        )

Executing the proxy with mitmproxy -s block.py initiates it on localhost:8080, allowing it to filter requests as configured. This setup can then be integrated with Selenium to ensure all non-essential requests are blocked:

from selenium import webdriver

PROXY = "localhost:8080"  # Proxy address

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)

chrome = webdriver.Chrome(options=chrome_options)
chrome.get("https://web-scraping.dev/product/1")
chrome.quit()

This approach not only reduces bandwidth consumption by a significant factor—often between 2 to 10 times—but also expedites the scraping process by omitting the loading and rendering of unnecessary resources.

Tip: To fully leverage mitmproxy with Selenium for HTTPS sites, installing the mitmproxy certificate on the browser is necessary. Detailed instructions can be found in our tutorial on installing the mitmproxy certificate.

Related Questions

Related Blogs

Selenium
In the realm of web scraping, dealing with web pages that feature infinite scrolling is a scenario that often arises, particularly when using Selenium for...
Selenium
In the realm of automated web testing, dealing with browser dialog pop-ups via Selenium stands as a crucial skill, especially when navigating through scenarios typically...
Selenium
Modal pop-ups, such as cookie consent notifications or login requests, are common challenges when scraping websites with Selenium. These pop-ups typically utilize custom JavaScript to...