Scraping Html List Data From A Dynamic Server

March 31, 2024 Post a Comment

Hallo guys! Sorry for the dump question, this is my last resort. I swear i triend countless of other Stackoverflow questions, different Frameworks, etc., but those didnt seem to he

Solution 1:

@SpencerBench is spot on in saying

It could be that the page is using some combination of scroll state, element visibility, or element positions to trigger content loading. If that's the case, then you'll need to figure out what it is and trigger it programmatically.

To answer the question for your specific use case, we need to understand the behaviour of the page you want to scrape data from, or as I asked in the comments, how do you know the page is "finished"?

However, it's possible to give a fairly generic answer to the question which should act as a starting point for you.

This answer uses Selenium, a package which is commonly used for automating testing of web UIs, but as they say on their home page, that's not the only thing it can be used for.

Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should) also be automated as well.

The web site I'm scraping

So first we need a web site. I've created one using ASP.net core MVC with .net core 3.1, although the web site's technology stack isn't important, it's the behaviour of the page you want to scrape which is important. This site has 2 pages, unimaginatively called Page1 and Page2.

Page controllers

There's nothing special in these controllers:

namespaceStackOverflow68925623Website.Controllers
{
    using Microsoft.AspNetCore.Mvc;

    publicclassPage1Controller : Controller
    {
        public IActionResult Index()
        {
            return View("Page1");
        }
    }
}

namespaceStackOverflow68925623Website.Controllers
{
    using Microsoft.AspNetCore.Mvc;

    publicclassPage2Controller : Controller
    {
        public IActionResult Index()
        {
            return View("Page2");
        }
    }
}

API controller

There's also an API controller (i.e. it returns data rather than a view) which the views can call asynchronously to get some data to display. This one just creates an array of the requested number of random strings.

namespaceStackOverflow68925623Website.Controllers
{
    using Microsoft.AspNetCore.Mvc;
    using System;
    using System.Collections.Generic;
    using System.Text;

    [Route("api/[controller]")]
    [ApiController]
    publicclassDataController : ControllerBase
    {
        [HttpGet("Create")]
        public IActionResult Create(int numberOfElements)
        {
            var response = new List<string>();
            for (var i = 0; i < numberOfElements; i++)
            {
                response.Add(RandomString(10));
            }

            return Ok(response);
        }

        privatestringRandomString(int length)
        {
            var sb = new StringBuilder();
            var random = new Random();
            for (var i = 0; i < length; i++)
            {
                var characterCode = random.Next(65, 90); // A-Z
                sb.Append((char)characterCode);
            }

            return sb.ToString();
        }
    }
}

Views

Page1's view looks like this:

@{
    ViewData["Title"] = "Page 1";
}

<div class="text-center">
    <divid="list" /><scriptsrc="~/lib/jquery/dist/jquery.min.js"></script><script>var apiUrl = 'https://localhost:44394/api/Data/Create';

        $(document).ready(function () {
            $('#list').append('<li id="loading">Loading...</li>');
            $.ajax({
                url: apiUrl + '?numberOfElements=20000',
                datatype: 'json',
                success: function (data) {
                    $('#loading').remove();
                    var insert = ''for (var item of data) {
                        insert += '<li>' + item + '</li>';
                    }
                    insert = '<ul id="results">' + insert + '</ul>';
                    $('#list').html(insert);
                },
                error: function (xht, status) {
                    alert('Error: ' + status);
                }
            });
        });
    </script>
</div>

So when the page first loads, it just contains an empty div called list, however the page loading trigger's the function passed to jQuery's $(document).ready function, which makes an asynchronous call to the API controller, requesting an array of 20,000 elements. While the call is in progress, "Loading..." is displayed on the screen, and when the call returns, this is replaced by an unordered list containing the received data. This is written in a way intended to be friendly to developers of automated UI tests, or of screen scrapers, because we can tell whether all the data has loaded by testing whether or not the page contains an element with the ID results.

Page2's view looks like this:

@{
    ViewData["Title"] = "Page 2";
}

<div class="text-center">
    <divid="list"><ulid="results" /></div><scriptsrc="~/lib/jquery/dist/jquery.min.js"></script><script>var apiUrl = 'https://localhost:44394/api/Data/Create';
        var requestCount = 0;
        var maxRequests = 20;

        $(document).ready(function () {
            getData();
        });

        functiongetDataIfAtBottomOfPage() {
            console.log("scroll - " + requestCount + " requests");
            if (requestCount < maxRequests) {
                console.log("scrollTop " + document.documentElement.scrollTop + " scrollHeight " + document.documentElement.scrollHeight);
                if (document.documentElement.scrollTop > (document.documentElement.scrollHeight - window.innerHeight - 100)) {
                    getData();
                }
            }
        }

        functiongetData() {
            window.onscroll = undefined;
            requestCount++;
            $('results2').append('<li id="loading">Loading...</li>');
            $.ajax({
                url: apiUrl + '?numberOfElements=50',
                datatype: 'json',
                success: function (data) {
                    var insert = ''for (var item of data) {
                        insert += '<li>' + item + '</li>';
                    }
                    $('#loading').remove();
                    $('#results').append(insert);
                    if (requestCount < maxRequests) {
                        window.setTimeout(function () { window.onscroll = getDataIfAtBottomOfPage }, 1000);
                    } else {
                        $('#results').append('<li>That\'s all folks');
                    }
                },
                error: function (xht, status) {
                    alert('Error: ' + status);
                }
            });
        }
    </script></div>

This gives a nicer user experience because it requests data from the API controller in multiple smaller chunks, so the first chunk of data appears fairly quickly, and once the user has scrolled down to somewhere near the bottom of the page, the next chunk of data is requested, until 20 chunks have been requested and displayed, at which point the text "That's all folks" is added to the end of the unordered list. However this is more difficult to interact with programmatically because you need to scroll the page down to make the new data appear.

(Yes, this implementation is a bit buggy - if the user gets to the bottom of the page too quickly then requesting the next chunk of data doesn't happen until they scroll up a bit. But the question isn't about how to implement this behaviour in a web page, but about how to scrape the displayed data, so please forgive my bugs.)

The scraper

I've implemented the scraper as a xUnit unit test project, just because I'm not doing anything with the data I've scraped from the web site other than Asserting that it is of the correct length, and therefore proving that I haven't prematurely assumed that the web page I'm scraping from is "finished". You can put most of this code (other than the Asserts) into any type of project.

Having created your scraper project, you need to add the Selenium.WebDriver and Selenium.WebDriver.ChromeDriver nuget packages.

Page Object Model

I'm using the Page Object Model pattern to provide a layer of abstraction between functional interaction with the page and the implementation detail of how to code that interaction. Each of the pages in the web site has a corresponding page model class for interacting with that page.

First, a base class with some code which is common to more than one page model class.

namespaceStackOverflow68925623Scraper
{
    using System;
    using OpenQA.Selenium;
    using OpenQA.Selenium.Support.UI;

    publicclassPageModel
    {
        protectedPageModel(IWebDriver driver)
        {
            this.Driver = driver;
        }

        protected IWebDriver Driver { get; }

        publicvoidScrollToTop()
        {
            var js = (IJavaScriptExecutor)this.Driver;
            js.ExecuteScript("window.scrollTo(0, 0)");
        }

        publicvoidScrollToBottom()
        {
            var js = (IJavaScriptExecutor)this.Driver;
            js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight)");
        }

        protected IWebElement GetById(string id)
        {
            try
            {
                returnthis.Driver.FindElement(By.Id(id));
            }
            catch (NoSuchElementException)
            {
                returnnull;
            }
        }

        protected IWebElement AwaitGetById(string id)
        {
            var wait = new WebDriverWait(Driver, TimeSpan.FromSeconds(10));
            return wait.Until(e => e.FindElement(By.Id(id)));
        }
    }
}

This base class gives us 4 convenience methods:

Scroll to the top of the page
Scroll to the bottom of the page
Get the element with the supplied ID, or return null if it doesn't exist
Get the element with the supplied ID, or wait for up to 10 seconds for it to appear if it doesn't exist yet

And each page in the web site has its own model class, derived from that base class.

namespace StackOverflow68925623Scraper
{
    using OpenQA.Selenium;

    publicclassPage1Model : PageModel
    {
        publicPage1Model(IWebDriver driver) : base(driver)
        {
        }

        public IWebElement AwaitResults => this.AwaitGetById("results");

        publicvoidNavigate(){
            this.Driver.Navigate().GoToUrl("https://localhost:44394/Page1");
        }
    }
}

namespace StackOverflow68925623Scraper
{
    using OpenQA.Selenium;

    publicclassPage2Model : PageModel
    {
        publicPage2Model(IWebDriver driver) : base(driver)
        {
        }

        public IWebElement Results => this.GetById("results");

        publicvoidNavigate(){
            this.Driver.Navigate().GoToUrl("https://localhost:44394/Page2");
        }
    }
}

And the Scraper class:

namespaceStackOverflow68925623Scraper
{
    using OpenQA.Selenium.Chrome;
    using System;
    using System.Threading;
    using Xunit;

    publicclassScraper
    {
        [Fact]
        publicvoidTestPage1()
        {
            // Arrangevar driver = new ChromeDriver();
            var page = new Page1Model(driver);
            page.Navigate();
            try
            {
                // Actvar actualResults = page.AwaitResults.Text.Split(Environment.NewLine);

                // Assert
                Assert.Equal(20000, actualResults.Length);
            }
            finally
            {
                // Ensure the browser window closes even if things go pear-shaped
                driver.Quit();
            }
        }

        [Fact]
        publicvoidTestPage2()
        {
            // Arrangevar driver = new ChromeDriver();
            var page = new Page2Model(driver);
            page.Navigate();
            try
            {
                // Actwhile (!page.Results.Text.Contains("That's all folks"))
                {
                    Thread.Sleep(1000);
                    page.ScrollToBottom();
                    page.ScrollToTop();
                }

                var actualResults = page.Results.Text.Split(Environment.NewLine);

                // Assert - we expect 1001 because of the extra "that's all folks"
                Assert.Equal(1001, actualResults.Length);
            }
            finally
            {
                // Ensure the browser window closes even if things go pear-shaped
                driver.Quit();
            }
        }
    }
}

So, what's happening here?

// Arrangevar driver = new ChromeDriver();
var page = new Page1Model(driver);
page.Navigate();

ChromeDriver is in the Selenium.WebDriver.ChromeDriver package and implements the IWebDriver interface from the Selenium.WebDriver package with the code to interact with the Chrome browser. Other packages are available containing implementations for all popular browsers. Instantiating the driver object opens a browser window, and calling its Navigate method directs the browser to the page we want to test/scrape.

// Actvar actualResults = page.AwaitResults.Text.Split(Environment.NewLine);

Because on Page1, the results element doesn't exist until all the data has been displayed, and no user interaction is required in order for it to be displayed, we use the page model's AwaitResults property to just wait for that element to appear and return it once it has appeared.

AwaitResults returns an IWebElement instance representing the element, which in turn has various methods and properties we can use to interact with the element. In this case we use its Text property which returns the element's contents as a string, without any markup. Because the data is displayed as an unordered list, each element in the list is delimited by a line break, so we can can use String's Split method to convert it to a string array.

Page2 needs a different approach - we can't use the presence of the results element to determine whether the data has all been displayed, because that element is on the page right from the start, instead we need to check for the string "That's all folks" which is written right at the end of the last chunk of data. Also the data isn't loaded all in one go, and we need to keep scrolling down in order to trigger the loading of the next chunk of data.

// Actwhile (!page.Results.Text.Contains("That's all folks"))
{
    Thread.Sleep(1000);
    page.ScrollToBottom();
    page.ScrollToTop();
}

var actualResults = page.Results.Text.Split(Environment.NewLine);

Because of the bug in the UI that I mentioned earlier, if we get to the bottom of the page too quickly, the fetch of the next chunk of data isn't triggered, and attempting to scroll down when already at the bottom of the page doesn't raise another scroll event. That's why I'm scrolling to the bottom of the page and then back to the top - that way I can guarantee that a scroll event is raised. You never know, the web site you're trying to scrape data from may itself be buggy.

Once the "That's all folks" text has appeared, we can go ahead and get the results element's Text property and convert it to a string array as before.

// Assert - we expect 1001 because of the extra "that's all folks"
Assert.Equal(1001, actualResults.Length);

This is the bit that won't be in your code. Because I'm scraping a web site which is under my control, I know exactly how much data it should be displaying so I can check that I've got all the data, and therefore that my scraping code is working correctly.

Solution 2:

If you need to fully execute the web page, then a complete browser like CefSharp is your only option.

It could be that the page is using some combination of scroll state, element visibility, or element positions to trigger content loading. If that's the case, then you'll need to figure out what it is and trigger it programmatically. I know that CefSharp can simulate user actions like clicking, scrolling, etc.

Html5 Pros