Web Scraping with Python Tutorial – A Complete Guide with Examples

Himanshu Sheth

Posted On: November 7, 2023

view count164693 Views

Read time40 Min Read

We live in an era where we are surrounded by data that can be harnessed by extracting meaningful insights from it. As quoted by Tim Berners-Lee, inventor of the World Wide Web – “Data is a precious thing and will last longer than the systems themselves.”

If the data is the new oil, web scraping (or web harvesting) is the expeller that helps squeeze more oil.🙂Web scraping can be leveraged for analyzing and deriving actionable insights from tons of information available on the internet.

Irrespective of the business domain (e.g., eCommerce, EdTech, Fintech, etc.), scraping can be used for market research, pricing intelligence, lead generation, and sentiment analysis, to name a few! Though web scraping is immensely useful, it comes with a caveat – Scraping should be done legally and ethically, respecting the website’s T&C and data privacy regulations.

Popular programming languages like Python, Java, JavaScript, etc., are well-equipped with tools and libraries that can ease the job of web scraping. In this blog, I will limit the discussion of web scraping with Python – a popular programming language for automated testing. You can check my earlier blog on Python automation testing, highlighting why Python is the best-suited language for automation testing.

By the end of this blog on web scraping with Python, you will learn to scrap static and dynamic websites using the best Python tools (or libraries) like PyUnit, pytest, and Beautiful Soup. The actionable examples will help you harness the capabilities of web scraping with Python to extract meaningful insights from websites.

Without further ado, let’s get started…

Note: Scrap & Scrape and document & page are used interchangeably throughout the blog.

What is Web Scraping?

To set the ball rolling, let’s do a quick recap of the What & Why of web scraping. In simple terms, web scraping, or web harvesting (or data extraction), is the technique for deriving information (that matters) from websites.

Shown below is the simplistic representation of web scraping that shows that input is a website that needs to be scrapped by the scraping logic.

Web Scraping Architecture

Web Scraping Architecture

Once the information from the HTML of the page is scraped, the raw data can be stored in a more presentable (or readable) format in Excel sheets, databases, etc.

With scraping options in the arsenal, it becomes imperative to opt for efficient and responsible scraping! In the examples presented later in this blog on web scraping with Python, the scraped information will be presented on the terminal.

Prominent Use Cases of Web Scraping

Most new-age websites leverage JavaScript for loading and updating content on the page. AJAX (Asynchronous JavaScript and XML) is used to retrieve data from the server and update relevant parts of the page (without a full page reload).

For instance, videos on the LambdaTest YouTube channel are dynamically loaded when the user performs a scroll operation. As seen in the below screenshots, the URL remains unchanged, but the videos are loaded on a dynamic basis.

Prominent Use Cases of Web Scraping

Dynamic Website Content

Example – Dynamic Website Content

Popular Python frameworks like pytest and PyUnit, in conjunction with Selenium, can be used for dynamic web scraping websites (and SPAs) like YouTube, Netflix, LambdaTest eCommerce Playground, and more.

It goes without saying that scraping can be done on static websites where the entire HTML page content is downloaded and parsed to extract the desired information. You can check out my earlier blog highlighting the difference between Static and Dynamic Web Scraping.

Now that we know that it is possible to scrap static and dynamic websites, let’s look at some of the primary use cases of web scraping:

Competitor Analysis

Irrespective of the size (or scale) of the business, businesses must keep a close watch on their competition. Insights into competitor’s products and services can be instrumental in having an upper edge over the competition.

Web scraping can effectively scrape relevant information (e.g., products, services, pricing, etc.) from competitors’ websites. The scraped data can be leveraged for tweaking product & pricing-specific information on the website.

Lead Generation

Web scraping can gather details (e.g., names, phone numbers, email addresses, etc.) from websites to generate leads for your business.

The scraped data is then fed to CRM software, so the sales team can reach out to those leads.

Data Analysis

Many websites have products that span across different categories. For instance, a horizontal eCommerce website might have products across electronics, apparel, baby care, etc.

Web scraper is instrumental in extracting product metadata, seller details, number of SKUs, etc., from such websites. Whether it is a SPA (Single Page Application) or paginated content, relevant web scraping tools & libraries let you scrape information from them.

Apart from the use cases mentioned above, you can also use web scraping for academic research and training data for ML projects!

Python Libraries and Tools for Web Scraping

I have also tried web scraping using C#, but the web scraping ecosystem is nothing like Python. One of the major benefits of using Python is the availability of many tools and libraries for web scraping.

It is recommended to perform the package installation in a virtual environment (venv) to isolate it from the packages in the base environment. Run the commands virtualenv venv and source venv/bin/activate on the terminal to create the virtual environment.

Though I will be covering a few libraries (or tools) using a detailed demonstration, let’s list some of the popular ones below:

Selenium

For starters, Selenium is a popular open-source framework that helps with web browser automation. The tools and libraries offered by Selenium are instrumental in automating interactions with the elements in the document (or page). When writing this blog on web scraping with Python, the latest version was 4.14.0.

Selenium 3, the earlier version of Selenium, used the JSON-Wire protocol. On the other hand, Selenium 4 uses the W3C protocol because automated tests are less flaky (or more stable) when compared to tests implemented using Selenium 3. For a quick recap, you can check out our blog highlighting the differences between Selenium 3 and Selenium 4.

Most modern-day web pages use AJAX & JavaScript to load and display dynamic content. Consider LambdaTest eCommerce playground. The content is loaded when the user scrolls the page. As seen in the screenshot below, the images are lazy loaded and appear in the DOM once the user scrolls the page.

Example of Lazy Loading of Images

LambdaTest eCommerce playground

Example of Lazy Loading of Images

This is where Selenium proves to be an instrumental option for web scraping due to its ability to handle dynamically loaded content using JavaScript.

In the case of Python, popular test automation frameworks like pytest and PyUnit can be leveraged for scraping static and dynamic page content. Run the command pip install selenium (or pip3 install selenium) depending on the Python version installed in the machine.

PyUnit (or unittest ) is the default test automation framework, part of the Python standard library.

PyUnit

PyUnit is inspired by the xUnit framework in C# and is also compatible with older versions of Python. For pytest, check out our zero-to-hero Selenium pytest tutorial for a quick recap on running pytest with Selenium.

Run the command pip install pytest (or pip3 install pytest) to install the pytest framework on the machine.

 pip install pytest

Explicit waits in Selenium will be leveraged for complex scenarios like waiting for the content to load, navigating through multiple pages, and more. In further sections of this blog on web scraping with Python, I will be using appropriate locators in Selenium along with find_element() / find_elements() methods to scrape content on the document (or web page).

Beautiful Soup

Beautiful Soup is a popular Python library primarily built for web scraping. As specified in the official documentation, Beautiful Soup can navigate and parse through HTML & XML documents. Unlike Selenium, which can be used for static and dynamic web scraping, Beautiful Soup is apt for static web scraping with Python.

It supports unit test discovery using pytest. When writing this blog on web scraping with Python, the latest version of Beautiful Soup is 4.12.1. Hence, the library is also referred to as BS4. In case you are using BS3, simply changing the package name from BeautifulSoup to bs4 will help in porting the code to BS4. For more details, you can refer to the official BS4 porting guide.

BS4 uses the html.parser module by default, which can be swapped out with lxml or html5lib libraries in Python. Web elements on the HTML page can be located using the CSS Selector and XPath locators. Akin to the find_element() method in Selenium Python, the find() method in BS4 returns the element located using the appropriate selector.

Similarly, the find_all() method in BS4 (like the find_elements() method in Selenium) scans through the entire page and returns a list of all the descendants that match the filters. You can refer to the official documentation of Beautiful Soup (or BS4) to get insights into all the methods provided by the library.

Before using BS4 for static web scraping, the built-in Requests library makes HTTP requests. Text (or text) that contains the HTML content from the HTTP response object is then subjected to HTML parsing (or html.parser) using BS4. Once parsed, the Beautiful Soup object can be used for navigating & searching through the document’s structure.

BS4 can also be used with Selenium to give wings to dynamic web scraping. In this case, BS4 is used to parse and extract data from the loaded (or rendered) HTML page, whereas Selenium is used to navigate to the relevant pages.

Run the command pip3 install beautifulsoup4 or pip3 install bs4 on the terminal to install Beautiful Soup 4 (or BS4). As seen below, the Beautiful Soup 4.12.2 installation was successful.

pip3 install beautifulsoup4

In further sections of the blog, I will demonstrate the usage of Beautiful Soup and Selenium for scraping content on various pages (1 through 5) on the LambdaTest eCommerce playground.

Playwright

Playwright is another popular test automation framework suited for web scraping. It is primarily used for end-to–end testing of modern web applications. Like Selenium, it also provides APIs that let you perform advanced interactions on the page using headless versions of Chrome, Chromium-based browsers, and WebKit.

Run the command pip3 install playwright (or pip install playwright) to install Playwright for Python on the machine.

pip3 install playwright

When writing this blog on web scraping with Python, the latest version of Playwright is 1.39.0. Auto-waiting in Playwright ensures that waits on elements are performed before they become actionable. This eliminates the need to add artificial timeouts, making the scraping code much more maintainable & readable.

Over and above, Playwright plugins like Pyppeteer (Python port of Puppeteer) and scrapy-playwright can also be leveraged for static and dynamic web scraping with Python.

Scrapy

Scrapy is a full-fledged web crawling and scraping framework that can be used to crawl websites to extract structured data from the page(s). Scrapy can be leveraged for data mining, monitoring, and automated testing.

When writing this blog, the latest version of Scrapy is 2.11.0. Run the command pip3 install scrapy (or pip install scrapy) on the terminal to install Scrapy.

pip3 install scrapy

Scrapy

pip install scrapy

As stated in the Scrapy documentation, Scrapy can extract data using APIs like Amazon Web Services (AWS). Though Scrapy is mainly used for static web scraping, it can also be used for scraping dynamic web pages (e.g. QuotestoScrape JS).

Similar to Beautiful Soup, Scrapy along with frameworks like Selenium or Playwright, can be instrumental in scraping dynamic web content.

QuotestoScrape JS

In such cases, the scrapy-playwright plugin is handy for scraping dynamic web content using Playwright and Scrapy.

PyQuery, lxml, and Mechanical Soup are other popular libraries for scraping with Python.

In the further sections of the blog on web scraping with Python, the discussion & demonstration will be limited to the following libraries (or tools):

  • Web scraping using Selenium PyUnit
  • Web scraping using Selenium pytest
  • Web scraping using Beautiful Soup

The learnings from the demonstration will be useful for creating sample code for scraping with other Python-based tools. So, let’s get our hands dirty with some code. 🧑‍💻

github

Demonstration: Web Scraping With Python

For demonstrating web scraping using Selenium PyUnit (& pytest) and Beautiful Soup, I will be using the following websites:

Framework Test URLs
Selenium PyUnit (or unittest )
Selenium pytest
Beautiful Soup

Directory Structure

As seen in the project structure, scraping using PyUnit, pytest, and Beautiful Soup is driven via a Makefile. Let’s look at the important directories and files in more detail below:

Directory Structure

Here is a closer look at the overall structure:

  • pageobject
  • Page Object Model (POM) design pattern is used for separate web locators from the core test logic. locators.py contains details of locators used for interacting with the test websites mentioned earlier.

    pageobject

    helpers.py contains the implementation of the wrapper functions for triggering actions, performing explicit waits, and more. Since the use cases for scraping with PyUnit and pytest are the same, two wrapper functions are created for reduced code duplication and improved maintenance:

    Test Method Purpose
    scrap_ecomm_content scraping LambdaTest eCommerce playground
    scrap_yt_content scraping LambdaTest YouTube channel

    Both the functions return a list where the scraped information is stored for further operations.

    further operations

  • tests
  • As the name suggests, the tests folder contains the implementation of the core test methods that are responsible for scraping the content using the respective scraping library (or tool).

    To keep the code more modular and maintainable, we have created subfolders (or sub-directories) for the libraries used for scraping the test websites:

    Sub-Folder FileName Purpose
    PyUnit test_ecommerce_scraping.py LambdaTest eCommerce playground scraping using Selenium & PyUnit
    test_yt_scraping.py LambdaTest YouTube channel scraping using Selenium & PyUnit
    pytest test_ecommerce_scraping.py LambdaTest eCommerce playground scraping using Selenium & pytest
    test_yt_scraping.py LambdaTest YouTube channel scraping using Selenium & pytest
    Beautiful Soup test_ecommerce_scraping.py LambdaTest eCommerce playground scraping using Beautiful Soup & Selenium
    test_infinite_scraping.py ScrapingClub (Infinite Scroll) website scraping using Beautiful Soup & Selenium

    We will cover more about the implementation in the code walkthrough section of the blog.

  • Configuration (or setup) files
  • The conftest.py file in pytest helps to fixture and share fixtures, hooks, and other configuration options used in the test code. For demonstration, scraping with Selenium and/or Beautiful Soup is performed using browsers on the local machine and cloud grid like LambdaTest.
    LambdaTest is an AI-powered test orchestration and execution platform that lets you perform automated testing using Selenium Python on an online browser farm of 3000+ real browsers and operating systems.

    Catch up on the latest tutorials around Selenium automation testing, Appium automation, and more. Subscribe to the LambdaTest YouTube Channel.

    Configuration (or setup) files

    The scope of the pytest fixture (i.e., @pytest.fixture) is set to function, which means that the fixture is set up and torn down before & after executing a test function (or method). As seen below, execution is controlled using the environment variable EXEC_PLATFORM, which can be set to local or cloud.

    EXEC_PLATFORM

    Username (LT_USERNAME) and Access Key (LT_ACCESS_KEY) for using the cloud Selenium Grid on LambdaTest can be obtained by navigating to LambdaTest Profile Page. Set them using the export (for macOS and Linux) or set (for Windows) command to export the environment variables.

    Username

    In the teardown function, the quit() method is invoked to terminate the browser session. The JavaScriptExecutor in Selenium [i.e., execute_script()] is used to update the lambda-status variable, which indicates the test execution status on LambdaTest Selenium Grid.

    JavaScriptExecutor in Selenium

    Similarly, we have created puynitsetup.py that contains configurations related to the PyUnit (or unittest ) framework. Implementation under the __init__() method is used for initializing the test cases.

    puynitsetup

    Irrespective of whether the tests are executed on a local grid or cloud Selenium grid, the browser is initialized in headless mode. If the environment variable EXEC_PLATFORM is set to the cloud, Chrome browser on the LambdaTest grid is instantiated in headless mode.

    lt_option

    The said capability (i.e., headless), along with the other capabilities, can be set using the LambdaTest Capabilities Generator.

    LambdaTest Capabilities Generator

    As stated in the official Selenium documentation, Chrome is instantiated in the headless mode by adding –headless=new to Chrome options.

    official Selenium documentation

    Lastly, the setUp() and tearDown() methods are implemented to set up and tear down the test environment. In fact, __init__() is optional, and implementation under it can be moved under the setUp() method.

     setUp

  • Makefile
  • As stated earlier, scraping using the respective libraries (or tools) is controlled via a Makefile. Typing make help provide all the options available for execution.

    Makefile

    The make install command will install the packages/libraries specified in requirements.txt.

    Triggering make clean will clean up the generated folders and .pyc files. Use the below-mentioned commands to scrap test website(s) using the preferred library:

    Make Command Purpose
    make scrap-using-pyunit Scrap test websites using PyUnit and Selenium
    make scrap-using-pytest Scrap test websites using pytest and Selenium
    make scrap-using-beautiful-soup Scrap test websites using Beautiful Soup (i.e. bs4)

    Scraping will be performed on local grid or cloud grid (on LambdaTest) depending on the value of EXEC_PLATFORM environment variable. I am not getting into the nuances of the Makefile since it is pretty much self-explanatory!

Pre Requisites

First and foremost, trigger virtualenv venv and source venv/bin/activate on the terminal to create the virtual environment.

Pre Requisites

Since there is a provision to perform scraping on the cloud Selenium grid, it is recommended to have an account on LambdaTest. To scrap websites using Selenium on LambdaTest Grid, you must update LT_USERNAME and LT_ACCESS_KEY (in Makefile) from the LambdaTest Profile Page.

 LambdaTest Profile Page

Install the required frameworks and libraries (mentioned in requirements.txt) by triggering the make install command on the terminal.

make install

With this, the stage is finally set to scrap the test websites. Let’s dive in!

Since the scraping logic with Selenium & Python remains the same for PyUnit and pytest, we have combined both frameworks in this section. To begin with, we look into the configuration aspects of the framework, after which we would look into the common scraping logic.

Web Scraping using Selenium and PyUnit

For simplification, all setup-related code (i.e., instantiating the browser, setting explicit timeouts, etc.) is in pyunitsetup.py. Essential aspects of the setup are already discussed in the earlier section of this blog on web scraping with Python.

Web Scraping using Selenium and PyUnit

Let’s deep dive into some of the integral parts of the configuration file pyunitsetup.py

The value of the EXEC_PLATFORM environment variable (i.e., cloud or local) decides whether the instantiation of the browser is on the local machine or cloud grid on LambdaTest.

EXEC_PLATFORM environment variable

Remote WebDriver with grid URL & browser options as parameters are instantiated using the webdriver.Remote() method. Capabilities (or browser options) can be obtained from LambdaTest Capabilities Generator.

lambdaTest Capabilities Generator

Chrome in headless mode is instantiated as we do not require GUI for web scraping with Python. The w3c flag is set to true since we are using Selenium 4 (which is W3C compliant) for the tests.

Chrome in headless mode is instantiated

When using a local grid (or machine), we set the –headless=new argument for Chrome Options.

use a local grid

Finally, the setUp() and tearDown() methods contain the implementation of maximizing the browser and releasing the resources post text execution respectively.

setUp() and tearDown() methods

Web Scraping using Selenium and pytest

Since the demo websites scraped with PyUnit and pytest are the same, there are changes only in the setup-related code. Like web scraping with PyUnit, all setup-related code (i.e., instantiating the browser, setting explicit timeouts, etc.) is segregated in conftest.py.

For starters, conftest.py is a special file in pytest that contains the configuration of the test suite. Hence, hooks, fixtures, and other configurations are all a part of conftest.py

Web Scraping using Selenium and pytest

Let’s deep dive into some of the integral parts of the configuration file conftest.py

The @pytest.fixture(scope=’function’) decorator is used for defining a fixture with a function scope.

defining a fixture with a function scope

Depending on the value of the EXEC_PLATFORM environment variable (i.e. cloud or local), the browser is instantiated on the local machine or cloud grid on LambdaTest.

Depending on the value of the EXEC_PLATFORM

In case you are planning to run the Selenium Python tests (i.e., scrap website under test) on browsers in LambdaTest cloud, instantiate Remote WebDriver with grid URL & browser options as parameters to the webdriver.Remote() method. Capabilities (or browser options) can be obtained from LambdaTest Capabilities Generator.

run the Selenium Python tests

Since we are performing web scraping with Python, the browser (i.e., Chrome) is instantiated in the headless mode. Headless Chrome is faster than the real browser (with the GUI). Hence, it is best suited for web scraping with Python. Since Selenium 4 (W3C compliant) is used for testing, the w3c flag is set to true.

 Headless Chrome is faster than the real browser

For scraping using Selenium on local headless browsers (e.g., Chrome) on a local grid (or machine), simply set the –headless=new argument for Chrome Options.

For scraping using Selenium on local headless browsers

Lastly, the yield statement provides the resource (i.e., browser) in the setUp and tearDown sections of the code. Hence, all the resources the browser uses will be cleared once the execution is complete.

yield statement provides the resource

With this, we are all set to dive deep into the scraping implementation!

Test Scenario 1 – Scraping LambdaTest YouTube Channel

In this example, meta-data (i.e., title, views, duration, etc.) associated with LambdaTest YouTube Videos is scraped using the PyUnit (or unittest framework).

Scraping LambdaTest YouTube Channel

Since the LambdaTest YouTube channel has more than 600+ videos, a vertical scroll (till the end of the page) is performed so that required information can be scraped from those videos!

Implementation (pyUnit)

The file test_yt_scraping.py contains the test logic where the scrap_youtube() method scraps the channel. It returns a list that contains the metadata of the videos.

To get started, we create an object of the pyunit_setup class which contains implementation of setUp() and tearDown() methods.

create an object of the pyunit_setup class

The browser instantiation is done in the __init__() method implemented in pyunitsetup.py. An empty list named meta_data_arr is created, and the browser is navigated to the URL under test.

The browser instantiation

Scraping YouTube content is legal as long as we comply with regulations that deal with personal data and copyright protection. After repeated scraping, we observed that YouTube showed the Consent Page, where we had to programmatically click the Accept all button to navigate to the LambdaTest YouTube channel.

Scraping YouTube content is legal

Since the above consent page was shown intermittently, we used the try…catch loop to counter the above situation. We first locate the Accept all button using the CSS Selector form:nth-child(3) > div > div > button[aria-label= ‘Accept all’] Alternatively, we could have also located the element using the XPath locator //div[@class=’csJmFc’]/form[2]//div[@class=’VfPpkd-RLmnJb’]

Accept All button using the CSS Selector

To locate elements with ease, we used the POM Builder that provides CSS Selector & XPath of elements at the click of a button. The WebDriverWait class is used along with the element_to_be_clickable expected condition in Selenium to realize an explicit wait for 5 seconds.

Once the element is located, the click method in Selenium is invoked to perform a click operation on the Click all button.

Once the element is located, the click method in Selenium

 

perform a click operation on the Click All button

In case the YouTube Consent Page does not show up, we simply handle the exception in the expect block and print the exception.

YouTube Consent Page

Now that we are on the LambdaTest YouTube Channel, we invoke the scrap_yt_content() method that is a part of the helpers class. We will look into its implementation in the further sections of the blog.

 invoke the scrap_yt_content() method

Upon successful scraping, a list meta_data_arr containing the scraped content is returned, and the content is printed on the terminal using the print_scrapped_content() method.

Implementation (Pytest)

Like PyUnit, the core logic for scraping the YouTube channel is implemented in the scrap_youtube() method in test_yt_scraping.py. It returns a list that contains the metadata of the videos.

Since the majority of the implementation remains the same, we would be focussing only on the changes that are specific to the pytest framework.

The @pytest.mark.usefixtures(‘driver’) decorator indicates that the driver (i.e., instantiated Chrome browser) fixture must be used in the execution of the test methods.

fixture must be used in the execution

As mentioned in the earlier section, the click to Accept all button is initiated in case the YouTube consent form is displayed on the screen.

All button is initiated in case the YouTube consent

Once you are on the LambdaTest YouTube channel, the scrap_yt_content() method is invoked to scrap the channel content. Lastly, the scraped content is printed on the screen.

LambdaTest YouTube channel, the scrap_yt_content() method is invoked to scrap

Since the rest of the implementation is the same as in the PyUnit framework, hence we are not touching upon those aspects in this section.

Implementation (Core Scraping Logic)

The helpers.py file contains the core implementation for scraping the LambdaTest YouTube channel.

Code Walkthrough

To get started, we wait for a maximum duration of 10 seconds till the document (or page) is fully loaded (i.e. Document.readyState is equal to complete). Feel free to check out the detailed DOM tutorial if you want a quick refresher about the nuances of DOM.

wait for a maximum duration of 10 seconds

Once the page is loaded, we first get the scrollable height of the document using the JavaScript function document.documentElement.scrollHeight. Before doing so, we maximize the browser window since that is considered one of the best practices in Selenium automation.

first get the scrollable height of the document

Next, we run a while loop where a vertical scroll is performed using the window.scrollTo() method with document.documentElement.scrollHeight as the input argument.

vertical scroll is performed

In order to see the window.scrollTo() method in action, open the LambdaTest YouTube channel on your browser. Then trigger the following commands in the browser console:

  • document.documentElement.scrollHeight
  • window.scrollTo(0, document.documentElement.scrollHeight)

open the LambdaTest YouTube channel on your browser

In the implementation shown above, we perform the above two actions till the end of the page is reached. This is when we break from the while loop!

The video meta-data (i.e., name, views, duration, etc.) is present in the div with ID details, which is under another div dismissible. The said element(s) are located using the CSS Selector (i.e., dismissible > #details).

perform the above two actions till the end of the page is reached

As seen in the Inspect Tools screenshot, there are 581 entries (or elements) when searched with CSS Selector (i.e., dismissible > #details). In summary, the LambdaTest YouTube channel has 581 videos when writing this blog on web scraping with Python.

Inspect Tools screenshot

Next up, we locate the element(s) using CSS Selector #meta, which is nested under the div #details.

locate the element(s) using CSS Selector

nested under the div #details

Now that we have located the #meta tag, we procure the video title, which is nested in the #meta tag. The title of the video is located using the #video-title CSS Selector.

title of the video is located

element

As seen below, the innerText attribute of the element (CSS Selector #video-title) provides the title of the video!

using

video-title

The next task is to find the video views, which are located using the nested CSS Selector #metadata > #metadata-line > span:nth-child(3)

As seen in the Inspect Tools screenshot, #metadata-line is nested inside #metadata (i.e., Element with ID #metadata-line is a child of an element with ID #metadata). To find the element’s locator that gives video views, simply hover over the views (of the first video) and copy the selector from the Inspect tools in the browser.

third child

The pseudo class nth-child(3) selects the third child span element among the direct children of the element with the ID metadata-line. Hence, the element located using CSS Selector #metadata > #metadata-line > span:nth-child(3) provides the views of the video in the earlier step.

earlier step

Locating the Video Views element using Inspect Tools

provides the number

The innerText attribute of the located element provides the number of views of the said video.

respective video

On similar lines, the publishing date of the respective video is also procured using the innerText attribute of the element located using the CSS Selector #metadata > #metadata-line > span:nth-child(4).

element located

loop that iterates

All the above-mentioned steps are executed in a for loop that iterates through the elements that were located using driver.find_elements(By.CSS_SELECTOR, “#dismissible > #details”)

Playground

Test Scenario 2 – Scraping LambdaTest eCommerce Playground

In this example, we scrap the product meta-data (i.e., product name and price) from LambdaTest eCommerce Playground. As mentioned earlier, the scraping logic works seamlessly on the local grid and cloud Selenium grid like LambdaTest.

Scraping LambdaTest

Implementation (Core Scraping Logic)

Since the overall setup & configuration-related implementation remain unchanged, we would directly deep dive into the scraping implementation.

Framework Configuration/Setup File
PyUnit pyunit/test_ecommerce_scraping.py
pytest pytest/test_ecommerce_scraping.py

Akin to the YouTube test scenario, the helpers.py file contains the core implementation where the scrap_ecomm_content() method is responsible for scraping the eCommerce Playground. It returns a list that contains the metadata of the SKUs on the page.

Code Walkthrough

Let’s directly look into the major aspects of the scrap_ecomm_content() method!

First, we create an instance of the ActionChains class to perform mouse movement operations on the page. We suggest having a look at our detailed blog on ActionChains in Selenium Python in case you are new to ActionChains or want to have a quick refresher of this important concept!

important concept

WebDriverWait and ExpectedConditions in Selenium are used to initiate an explicit wait till the visibility of a specified element is true. The element in question is the menu item that is located using the XPath “//a[contains(.,’Shop by Category’)]

method of ActionChains

Once the menu is visible, the move_to_element() method of ActionChains class is used to move (or focus) on the located menu item. Since only operation (i.e., click) is involved in the chain, we invoke the click() and perform() methods to trigger the required action.

containing Phone

Now that we are on the Menu, the menu item containing Phone, Tablets & Ipod uses the XPath selector. Once the menu item is located, the combination of move_to_element() and click operations is performed to open the Product Page.

wait

visibility

Now that we are on the Product Page, we wait for the visibility of the grid containing the respective products. By default, 15 products are available in a single go.

Alternatively

Nested XPath locator //div[@id=’entry_212391′]//div[@id=’entry_212408′]//div[@class=’row’] is used for locating the grid containing the products. Alternatively, you can also use the CSS Selector to locate the grid.

Simply right-click on the row and choose Copy → Copy CSS Selector to locate the grid using CSS Selector. You can check out our XPath vs CSS Selectors guide to get more insights into the advanced features of these selectors.

the next step

Now that we have located the grid (containing the 15 products), the next step is to locate every row housing the corresponding product. As shown below, we have <div>’s for each nested product under the locator used in the previous step.

Class Name

The find_elements() method in Selenium is used along with the Class Name selector to find the elements containing product(s) information. Since there are 15 products on the page, the length of the list returned by the find_elements() method will also be 15!

list returned

list to scrap

Next, we run a for loop and iterate through the list to scrap details of every product in the list. Since there are 15 products on the page, the loop will be executed 15 times.

As seen below, The element(s) located using CSS locator div.product-thumb > div.caption provides the meta-data (i.e., product name & price) of every product on the page.

Product Name

the Product Name

With the meta-data handy, the next step is to find the Product Name from the element located in the previous step.

located using

As seen from the DOM layout, the element located using nested CSS Selector .title .text-ellipsis-2 provides product details.

provides

The text attribute of the element provides the Product Name of the respective product.

element provides the link

On similar lines, the href attribute of the located element provides the link to the product (e.g., Product Link).

Like

current produc

Like the earlier step, the price of the current product is scraped by locating the element using the nested CSS Selector .price .price-new

nested

gives the price

Now that the element is located, the text attribute of the element gives the price of the product.

dictionary of the scraped

Finally, the scrap_ecomm_content() method returns a dictionary of the scraped content of the page under test!

Execution (PyUnit + Selenium)

As stated earlier, we can use the browser on a local machine (or grid) and cloud Selenium grid.

Set environment variable EXEC_PLATFORM to local using Chrome browser (headless mode) with Selenium for web scraping. Invoke the command make scrap-using-pyunit to start scraping content from the test website(s).

when the

As shown below, the Accept all button does not appear when the scraping is done on a local machine.

appear

On the other hand, the Accept all button is seen when the scraping is done using Chrome (headless mode) on cloud Selenium Grid.

LambdaTest YouTube

As seen below, the content from the LambdaTest YouTube channel and LambdaTest eCommerce playground was scrapped successfully.

YouTube channel

using Chrome

Set environment variable EXEC_PLATFORM to cloud for using Chrome browser (headless mode) on cloud Selenium Grid.

status of the test

Log on to the LambdaTest Automation Dashboard to view the status of the test execution.

successful

As seen below, the test execution on LambdaTest was successful!

Chrome browser

Execution (pytest + Selenium)

Like earlier, set EXEC_PLATFORM to local for using Chrome browser (headless mode) with Selenium for web scraping with Python. Invoke the command make scrap-using-pytest to start scraping content from the test website(s).

test

As seen below, scraping data from the test websites was successful.

test websites

headless

Set environment variable EXEC_PLATFORM to the cloud and invoke make scrap-using-pytest for using Chrome browser (headless mode) on cloud Selenium Grid.

cloud Selenium

indicates that

Shown below is the status on the dashboard, which indicates that web scraping using Selenium and pytest was successful.

using Selenium a

using Selenium a

Info Note

Execute your Selenium Python tests on the cloud. Try LambdaTest Today!

Web Scraping using Beautiful Soup and Selenium

For a demonstration of web scraping with Python using Beautiful Soup and Selenium, we would be scraping content from the following websites:

Test Website Framework (or library)
LambdaTest eCommerce Product Page Beautiful Soup
ScrapingClub (Infinite Scroll) Beautiful Soup + Selenium

We are using the combination of Beautiful Soup and Selenium for a very specific reason.🙂As you can see, both the test websites have dynamic content which is where Selenium has an upper-hand over Beautiful Soup.

Selenium is used for performing actions (e.g., scrolling, clicking, etc.) on the page (or document) so that the content to be scraped is available for our perusal. On the other hand, Beautiful Soup is used for parsing and navigating the structure of the HTML page.

The prerequisites and directory structure are already covered in the earlier sections of this blog on web scraping with Python. BeautifulSoup4 (or bs4) & Selenium (v 4.13.0) are used for scraping the ScrapingClub website, whereas bs4, along with Requests (v 2.31.0) help in scraping the eCommerce playground product page(s).

Demonstration: Scraping eCommerce Playground

In this test scenario, we would scrap the products from the eCommerce Playground product page. The website uses pagination, and 15 products are displayed on a single page.

  1. Navigate to the eCommerce Playground product page.
  2. Send an HTTP GET request to the URL using the Requests library.
  3. Using bs4, scrap product information from all the products of the said category.

Implementation

Similar to the configuration for the PyUnit & pytest frameworks, the locator & URL details are present in locators.py.

 locators.py.

The core scraping logic is implemented in test_ecommerce_scraping.py. Let’s dive deep into the same!

Code Walkthrough

Inside the __main__ construct, we set up the test URL, which constitutes the base URL appended by the page number (i.e. &page=< page-number >).

page-number

appended by the page numbe

Since there are 5 pages in total for the said product category, the scraping implementation is executed in an interactive loop (from 1..6).

scraping implementation is executed

To get started with scraping, an HTTP GET request is initiated using the get() method of the requests library. The method takes the URL under test as the input parameter and returns a response object.

 requests library.

The status code (i.e., status_code) attribute of the response object is compared with 200 (i.e., STATUS_OK). If the status_code is not 200 (e.g., 404, 403, etc.), the test is terminated.

status code

The response.text attribute contains the HTML content of the page, which is retrieved as a string. As stated earlier, the HTML parser (i.e., html.parser), the default parser in bs4, is used to parse the HTML content.

BeautifulSoup(response.text, ‘html.parser’) returns a BeautifulSoup object that will be used throughout scraping. The select() method of Beautiful Soup finds elements using the CSS Selector property – .product-layout.product-grid.no-desc.col-xl-4.col-lg-4.col-md-4.col-sm-6.col-6

As expected, 15 elements match the CSS Selector since there are 15 products on the page. The select() method returns a list that will be further used for iteratively scraping information (i.e., name, cost) of each product.

 CSS Selector since

name, cost

Next up, we run a for loop for scraping information of all the 15 products (or elements) under the div located in the previous step.

 loop for scraping information

The product link is obtained by locating the element using the find() method of Beautiful Soup. As seen in the Inspect Tools screenshot above, the first argument is the tag that needs to be searched for (i.e., “a” – anchor tag), and the second is the CSS Class attribute.

anchor tag

Link to the product is procured by using the get_attribute_list() method with the located element. In case you need more information on the find() method of Beautiful Soup, we recommend checking out the Beautiful Soup Official Documentation.

Beautiful Soup Official Documentation.

information on the find()

On similar lines, the product name is also scraped by using the find() method in Beautiful Soup. In this case, the first argument is the “h4” tag, and the second argument is the Class Name locator “title”.

Class Name locator “title”

Finally, the get_text() method of Beautiful Soup, when used with the recently located element, provides the product name of the current product.

get_text()

The last step is to scrap the price of the respective product. As shown below, the text under the element with Class Name “price-new” provides the product price.

 price of the respective product

Like before, the find() method of Beautiful Soup is used with “span” (i.e., tag to be located) as the first argument and Class Name “price-new” as the second argument. The get_text() method returns the text inside the element.

price-new

Finally, a dictionary (i.e., meta_data_dict) of the scraped information (i.e., product link, name, and price) is created, and the same is appended to the list (i.e., meta_data_arr).

meta_data_dict

The same steps are repeated for all the other product pages (i.e., Product Page 1Product Page 5), and the scraped data is displayed on the screen.

displayed on the screen.

Demonstration: Scraping Infinite Scrolling website

In this test scenario, we would be scraping the ScrapingClub website that has content loaded on a dynamic basis. Since the website involves dynamic content, the combination of Selenium & Beautiful Soup is used for web page interactions & scraping.

  1. Navigate to Infinite Scroll Page.
  2. Scroll till the end of the page.
  3. Scrap product content using Beautiful Soup.

Implementation

Akin to the PyUnit & pytest frameworks, the locator & URL details are present in locators.py

PyUnit & pytest frameworks

The core scraping logic is implemented in test_infinite_scraping.py. Let’s deep dive into the same!

Code Walkthrough

Inside the __main__ construct, we invoke the scrap_inifinite_website() method that is primarily responsible for scraping the website under test.

scrap_inifinite_website()

Since we are using Selenium for performing actions on the page, the EXEC_PLATFORM environment variable helps us choose between local grid (or machine) and the cloud Selenium Grid.

We have already covered Remote WebDriver and Headless browser testing in the earlier sections of the blog.

EXEC_PLATFORM environment

Since the content is loaded dynamically (i.e., on page scroll), scroll till the end of page is performed using the combination of the execute_script() method with document.documentElement.scrollHeight as the input argument.

on page scroll

Content & Images loaded on a dynamic basis

Content & Images loaded on a dynamic basis

Now that the page load is complete, create a Beautiful Soup object that parses the page source obtained using Selenium. You can check out our detailed blog that deep dives into getting page sources with Selenium WebDriver for more information on retrieving page sources using Selenium.

retrieving page sources using Selenium.

Since we already have the page source, it’s safe to release resources used by the instantiated browser instance.

release resources

Next, the find_all() method of Beautiful Soup is used with “div” (i.e., tag to be located) as the first argument and Class Name “w-full rounded border post” as the second argument. It returns a list of elements that meet the specified conditions.

tag to be located

As seen from the Inspect Tools screenshot, the length of the list returned by find_all() method should be 60 (i.e. 60 products are present on the page).

 products are present on the page

Next up, we use the for loop for iteratively scraping information (i.e. dress name, link, price) of each product.

dress name, link, price

The find() method of Beautiful Soup is used to locate the element with input as “h4” tag.

“h4” tag.

The text() attribute of the located element provides the product name (e.g., Short Dress). The leading & trailing spaces of the output are done using the strip() method.

product name (e.g., Short Dress)

The “href” attribute of the element located earlier using the “h4” tag [i.e., dress = row.find(‘h4’)] gives the product link.

dress = row.find(‘h4’)

“href” attribute

On similar lines, the price of the respective product is obtained by locating the element using the find() method. The method takes the “h5” tag as the input argument since the text() attribute of the element contains the product price.

element contains the product price

Like the earlier step, leading & trailing spaces from the output of text() attribute to get the final price of the respective product.

row find h5

With this, we have successfully scraped the meta-data of products from the ScrapingClub (Infinite Scrolling) website. Let’s trigger the same on the local as well as LambdaTest Cloud Selenium grid.

Execution

As stated earlier, we can instantiate the Chrome browser (in headless mode) on a local machine (or grid) as well as cloud Selenium grid.

Set environment variable EXEC_PLATFORM to local for using cloud Selenium grid. Invoke the command make scrap-using-beautiful-soup to start scraping content from the test website(s).

EXEC_PLATFORM

As seen below, we are able to scrap content from the LambdaTest E-commerce Playground Product Page(s) and ScrapingClub website.

LambdaTest E-commerce Playground Product Page(s

ScrapingClub website.

Set environment variable EXEC_PLATFORM to cloud for using Cloud Selenium Grid on LambdaTest.

variable EXEC_PLATFORM

cloud for using Cloud Selenium Grid on LambdaTest.

cloud for using Cloud Selenium Grid on LambdaTest 2

cloud for using Cloud Selenium Grid on LambdaTest 3

Shown below is the status on the dashboard, which indicates that web scraping with Python using Beautiful Soup and Selenium was successful.

status on the dashboard,

web scraping with Python using Beautiful Soup and Selenium

With this, we have covered web scraping with Python using popular libraries (or tools) like Selenium (with PyUnit & pytest) and Beautiful Soup. The popularly-quoted Spider-Man theme of “With great power, comes great responsibility” also applies to web scraping!

This is because web scraping with Python must be performed while taking legal and ethical considerations into account. It is unethical to scrape confidential (or sensitive) information from any website.

Web Scraping Done Right!

As stated, web scraping is a super-powerful tool to have in your data collection armory. However, its benefits should be leveraged with utmost caution! Python provides numerous tools (or libraries) for web scraping, making it one of the preferred languages for scraping web content.

Though the choice of web scraping tool purely depends on the requirements, Selenium can be preferred for scraping dynamic web content. On the other hand, Beautiful Soup is one of the preferred choices for scraping static web content.

Playwright is another popular framework for scraping web content. You can refer to our detailed blog on Web Scraping with Playwright and Python in case you intend to leverage the framework for scraping.

As a part of the web scraping series, I dabbled into Web Scraping with C# using Selenium. To summarize, web scraping is a great tool that helps make the most of public data, but it should be used with utmost responsibility.🙂

Happy Scraping !

Frequently Asked Questions (FAQs)

Is Python good for web scraping?

Yes, Python is widely recognized as an excellent programming language for web scraping. Python offers several libraries and frameworks, such as BeautifulSoup and Scrapy, that simplify the web scraping process. Its simplicity, readability, and rich ecosystem of packages make it a popular choice for web scraping tasks. Additionally, Python is platform-independent, which means you can use it on various operating systems to collect data from websites efficiently.

What is web scraping with Python?

Web scraping with Python refers to automatically extracting data from websites. Python provides various libraries and frameworks, such as BeautifulSoup, Scrapy, and Requests, that enable developers to write scripts to access web pages, parse the HTML content, and extract specific information or data. This data can be used for various purposes, including data analysis, research, content aggregation, or any other application where web data is needed. Web scraping with Python is a versatile and powerful technique for collecting information from the internet.

Is it legal to web scrape?

The legality of web scraping depends on various factors, including the website’s terms of service, the nature of the data being collected, and local or international laws. Here are some key points to consider:

  • Website’s Terms of Service: Websites often have service or usage policies that specify whether web scraping is allowed or prohibited. It’s essential to review and comply with these terms. Violating a website’s terms of service could lead to legal issues.
  • Publicly Accessible Data: Web scraping publicly accessible data, such as information without logging in, is typically considered more acceptable. However, this does not grant unlimited rights to use or redistribute the data.
  • Copyright and Intellectual Property: If the data being scraped is protected by copyright or contains intellectual property, scraping and using that data may infringe on those rights.
  • Personal Data and Privacy: Scraping and using personal data without consent can raise privacy and legal concerns, particularly under data protection laws like the GDPR in the European Union.
  • Anti-Competitive Behavior: Web scraping for anti-competitive purposes, like price scraping to undercut competitors unfairly, may be subject to legal action under antitrust laws.
  • Rate Limiting and Politeness: It’s advisable to implement rate limiting and be polite in your scraping activities. Excessive requests can strain a website’s servers and may be viewed as a denial-of-service attack.
  • Laws and Jurisdictions: The legal aspects of web scraping can vary by jurisdiction. What’s legal in one country may not be in another.

To ensure compliance with laws and ethical standards, it’s essential to consult with legal experts if you have concerns about the legality of your web scraping activities, particularly when dealing with sensitive or proprietary data. Additionally, always respect a website’s terms of service and robots.txt file, which can specify rules for web crawlers.

Author Profile Author Profile Author Profile

Author’s Profile

Himanshu Sheth

Himanshu Sheth is a seasoned technologist and blogger with more than 15+ years of diverse working experience. He currently works as the 'Lead Developer Evangelist' and 'Senior Manager [Technical Content Marketing]' at LambdaTest. He is very active with the startup community in Bengaluru (and down South) and loves interacting with passionate founders on his personal blog (which he has been maintaining since last 15+ years).

Blogs: 128



linkedintwitter

Test Your Web Or Mobile Apps On 3000+ Browsers

Signup for free