Logo New Black

Comprehensive Guide: How to Find All Links Using BeautifulSoup Effectively

BeautifulSoup, a cornerstone in the Python web scraping toolkit, offers a straightforward approach to parsing HTML and extracting valuable data. One of its core functionalities is the ability to efficiently locate all links on a webpage, utilizing either the find_all() method or CSS selectors and the select() method. This feature is indispensable for a wide range of applications, from data mining to automated testing, where gathering hyperlink information is a critical step. To complement the power of BeautifulSoup and elevate your scraping projects, incorporating a web crawling API into your workflow can significantly expand your capabilities. These APIs are crafted to seamlessly interact with the web, enabling you to bypass common scraping obstacles such as dynamic content loading and anti-scraping mechanisms, ensuring a more effective and less labor-intensive data collection process.

import bs4
soup = bs4.BeautifulSoup("""
<a href="/pricing">Pricing</a>
<a href="https://example.com/blog">Blog</a>
<a href="https://twitter.com/@company">Twitter</a>
""")
links = [node.get('href') for node in soup.find_all("a")]
[
    "/pricing",   
    "https://example.com/blog",
    "https://twitter.com/@company",
]
# or with css selectors:
link = [node.get('href') for node in soup.select('a')]

It’s important to note that bs4 extracts links as they are displayed on the page. Links can be:

  • Relative to the current website like /pricing
  • Absolute like https://example.com/blog
  • Absolute outbound like https://twitter.com/@company

You can convert all relative URLs to absolute using the urllib.parse.urljoin function:

from urllib.parse import urljoin

base_url = "https://example.com"
links = [urljoin(base_url, link) for link in links]
print(links)
# will print
"https://example.com/pricing"
"https://example.com/blog"
"https://twitter.com/@company"

If you want to limit your scraper to a specific website, you can filter out outbound URLs. The tldextract library can be used to identify the top-level domain (TLD):

import tldextract

allowed_domain = "example.com"
for link in links:
    tld = tldextract.extract("link").registered_domain
    if tld != allowed_domain:
        continue
    else:
        print(link)
# will print
"https://example.com/pricing"
"https://example.com/blog"
# notice the twitter url is missing