To extract images from a website, Python can be paired with HTML parsing tools like BeautifulSoup. This combination allows for the efficient selection and extraction of <img>
elements, making it possible to download images directly to your local system. The process involves identifying the image tags within the HTML structure of a webpage and retrieving their source attributes. For individuals seeking to enhance their web scraping capabilities, utilizing a web scraping API can significantly streamline the process, enabling more effective handling of complex web pages and the extraction of high-quality images. This guide will provide you with a step-by-step approach to scrape images from websites using Python and BeautifulSoup, ensuring you have the knowledge and tools needed for successful web scraping projects.
Here’s an example using httpx
and beautifulsoup
(install using pip install httpx beautifulsoup4
):
import asyncio
import httpx
from bs4 import BeautifulSoup
from pathlib import Path
async def download_image(url, filepath, client):
response = await client.get(url)
filepath.write_bytes(response.content)
print(f"Downloaded {url} to {filepath}")
async def scrape_images(url):
download_dir = Path('images')
download_dir.mkdir(parents=True, exist_ok=True)
async with httpx.AsyncClient() as client:
response = await client.get(url)
soup = BeautifulSoup(response.text, "html.parser")
download_tasks = []
for img_tag in soup.find_all("img"):
img_url = img_tag.get("src") # get image url
if img_url:
img_url = response.url.join(img_url) # turn url absolute
img_filename = download_dir / Path(str(img_url)).name
download_tasks.append(
download_image(img_url, img_filename, client)
)
await asyncio.gather(*download_tasks)
# example - scrape all scrape network blog images:
url = "https://scrapenetwork.com/"
asyncio.run(scrape_images(url))
In the above example, httpx.AsyncClient
is used to initially retrieve the target page HTML. Following this, all src
attributes of all <img>
elements are extracted. Finally, all images are downloaded concurrently and saved to the ./images
directory.