Web scraping with Puppeteer often involves dealing with pages that necessitate scrolling to the bottom to load additional content, a common feature of infinite-scrolling pages. To effectively manage this task, integrating a reliable web scraping API can significantly enhance the efficiency and accuracy of your data collection efforts, providing advanced features to handle dynamic content and infinite scrolling seamlessly.
For scrolling in our Puppeteer browser, we can utilize a custom javascript function window.scrollTo(x, y), which scrolls the page to the designated coordinates. This method, combined with a powerful scraping API, ensures that you can navigate and extract data from complex websites with ease.
If we need to scroll to the absolute bottom of the page, a while
loop can be employed to keep scrolling until the bottom is reached. Let’s examine an example by scraping web-scraping.dev/testimonials:
const puppeteer = require('puppeteer');
async function scrapeTestimonials() {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://web-scraping.dev/testimonials/');
let prevHeight = -1;
let maxScrolls = 100;
let scrollCount = 0;
while (scrollCount < maxScrolls) {
// Scroll to the bottom of the page
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
// Wait for page load
await page.waitForTimeout(1000);
// Calculate new scroll height and compare
let newHeight = await page.evaluate('document.body.scrollHeight');
if (newHeight == prevHeight) {
break;
}
prevHeight = newHeight;
scrollCount += 1;
}
// Collect all loaded data
let elements = await page.$$('.testimonial');
let results = [];
for(let element of elements) {
let text = await element.$eval('.text', node => node.innerHTML);
results.push(text);
}
console.log(`Scraped: ${results.length} results!`);
await browser.close();
}
scrapeTestimonials();
In the above example, we’re scraping an endless paging example from the web-scraping.dev
website. We initiate a while
loop and continue scrolling to the bottom until the browser’s vertical size ceases to change. Once the bottom is reached, we can commence parsing the content.