Logo New Black

Comprehensive Guide: How to Scrape Images from Website Using Python & BeautifulSoup

To extract images from a website, Python can be paired with HTML parsing tools like BeautifulSoup. This combination allows for the efficient selection and extraction of <img> elements, making it possible to download images directly to your local system. The process involves identifying the image tags within the HTML structure of a webpage and retrieving their source attributes. For individuals seeking to enhance their web scraping capabilities, utilizing a web scraping API can significantly streamline the process, enabling more effective handling of complex web pages and the extraction of high-quality images. This guide will provide you with a step-by-step approach to scrape images from websites using Python and BeautifulSoup, ensuring you have the knowledge and tools needed for successful web scraping projects.

Here’s an example using httpx and beautifulsoup (install using pip install httpx beautifulsoup4):

import asyncio
import httpx
from bs4 import BeautifulSoup
from pathlib import Path


async def download_image(url, filepath, client):
    response = await client.get(url)
    filepath.write_bytes(response.content)
    print(f"Downloaded {url} to {filepath}")


async def scrape_images(url):
    download_dir = Path('images')
    download_dir.mkdir(parents=True, exist_ok=True)

    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        download_tasks = []
        for img_tag in soup.find_all("img"):
            img_url = img_tag.get("src")  # get image url
            if img_url:
                img_url = response.url.join(img_url)  # turn url absolute
                img_filename = download_dir / Path(str(img_url)).name
                download_tasks.append(
                    download_image(img_url, img_filename, client)
                )
        await asyncio.gather(*download_tasks)

# example - scrape all scrape network blog images:
url = "https://scrapenetwork.com/"
asyncio.run(scrape_images(url))

In the above example, httpx.AsyncClient is used to initially retrieve the target page HTML. Following this, all src attributes of all <img> elements are extracted. Finally, all images are downloaded concurrently and saved to the ./images directory.