BeautifulSoup, a cornerstone in the Python web scraping toolkit, offers a straightforward approach to parsing HTML and extracting valuable data. One of its core functionalities is the ability to efficiently locate all links on a webpage, utilizing either the find_all()
method or CSS selectors and the select()
method. This feature is indispensable for a wide range of applications, from data mining to automated testing, where gathering hyperlink information is a critical step. To complement the power of BeautifulSoup and elevate your scraping projects, incorporating a web crawling API into your workflow can significantly expand your capabilities. These APIs are crafted to seamlessly interact with the web, enabling you to bypass common scraping obstacles such as dynamic content loading and anti-scraping mechanisms, ensuring a more effective and less labor-intensive data collection process.
import bs4
soup = bs4.BeautifulSoup("""
<a href="/pricing">Pricing</a>
<a href="https://example.com/blog">Blog</a>
<a href="https://twitter.com/@company">Twitter</a>
""")
links = [node.get('href') for node in soup.find_all("a")]
[
"/pricing",
"https://example.com/blog",
"https://twitter.com/@company",
]
# or with css selectors:
link = [node.get('href') for node in soup.select('a')]
It’s important to note that bs4 extracts links as they are displayed on the page. Links can be:
- Relative to the current website like
/pricing
- Absolute like
https://example.com/blog
- Absolute outbound like
https://twitter.com/@company
You can convert all relative URLs to absolute using the urllib.parse.urljoin
function:
from urllib.parse import urljoin
base_url = "https://example.com"
links = [urljoin(base_url, link) for link in links]
print(links)
# will print
"https://example.com/pricing"
"https://example.com/blog"
"https://twitter.com/@company"
If you want to limit your scraper to a specific website, you can filter out outbound URLs. The tldextract library can be used to identify the top-level domain (TLD):
import tldextract
allowed_domain = "example.com"
for link in links:
tld = tldextract.extract("link").registered_domain
if tld != allowed_domain:
continue
else:
print(link)
# will print
"https://example.com/pricing"
"https://example.com/blog"
# notice the twitter url is missing