ScrapeNetwork

Mastering Web Crawling: How to Ignore Non-HTML URLs Effectively

Table of Contents

Table of Contents

In the realm of data extraction and web analysis, efficiency and precision are paramount. One way to enhance the effectiveness of your web crawling efforts is by integrating a sophisticated web scraping API, which can help you filter out and ignore non-HTML URLs. This practice is crucial for optimizing the performance of your web crawlers, as it prevents them from getting bogged down by irrelevant or resource-intensive content. By focusing solely on HTML URLs, your scraper can operate more smoothly and reliably, ensuring that your data collection is both faster and more relevant to your needs. To achieve this, we can utilize two types of validation rules:

Initially, we can examine the URL extension for prevalent file formats:

import posixpath

IGNORED_EXTENSIONS = [
    # archives
    '7z', '7zip', 'bz2', 'rar', 'tar', 'tar.gz', 'xz', 'zip',
    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico',
    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 'm4a', 'm4v', 'flv', 'webm',
    # office suites
    'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg', 'odp',
    # other
    'css', 'pdf', 'exe', 'bin', 'rss', 'dmg', 'iso', 'apk',
]

url = "https://example.com/foo.pdf"
if posixpath.splitext(urlparse(url).path)[1].lower() not in self.IGNORED_EXTENSIONS:
    print("+ url is valid")
else:
    print("- url is invalid")

However, not all URLs come with file extensions. As an alternative, we can inspect the Content-Type header of potential URLs using HEAD-type requests, which only scrape the document’s metadata:

import requests

VALID_TYPES = [
    "text/html",
    # we might also want to scrape plain text files:
    "text/plain",
    # or json files:
    "application/json",
    # or even javascript files
    "application/javascript",
]

url = "https://example.com/foo.pdf"
response = requests.head(url)
if response.headers['Content-Type'] in VALID_TYPES:
    print("+ url is valid")
else:
    print("- url is invalid")

Related Questions

Related Blogs

Web Crawling
Web crawling and web scraping are two interconnected concepts in the realm of data collection, each offering unique exploration capabilities. While web crawling refers to...
Beautifulsoup
BeautifulSoup, a cornerstone in the Python web scraping toolkit, offers a straightforward approach to parsing HTML and extracting valuable data. One of its core functionalities...
Python
Identifying the file type of a URL is a crucial step in various data processing and web scraping projects. There are primarily two methods to...