Python Web Scraping Tutorial

Web Scraping is a method employed to gather substantial amounts of data from various websites. The phrase "scraping" signifies the process of retrieving information from external sources (such as web pages) and storing it in a local file. For instance, imagine you are developing a project named "Mobile Phone Comparison Website," where you need to collect the prices, ratings, and model names of mobile phones to facilitate comparisons among different devices.

Gathering this information by manually visiting multiple websites can be quite time-consuming. In such scenarios, web scraping becomes invaluable, as it allows you to obtain the required data by simply writing a few lines of code.

Web scraping involves the retrieval of data from websites that is presented in an unstructured format. This process facilitates the gathering of such unstructured information and transforms it into a structured format.

Startups are inclined towards web scraping as it offers an economical and efficient method to acquire substantial volumes of data without the need for collaboration with the organization that sells the data.

Is Web Scraping legal?

In this context, a pertinent question emerges regarding the legality of web scraping. The response is that certain websites permit it when conducted within legal parameters. Web scraping serves merely as a mechanism; it can be utilized appropriately or improperly, depending on the circumstances.

Image 1

Extracting data from the web can be considered unlawful when it involves scraping nonpublic information. Nonpublic data refers to information that is not available to the general public; attempting to retrieve such data constitutes a breach of legal regulations.

There are several tools available to scrape data from websites, such as:

✅ Scraping-bot
✅ Scrapper API
✅ Octoparse
✅ Import.io
✅ Webhose.io
✅ Dexi.io
✅ Outwit
✅ Diffbot
✅ Content Grabber
✅ Mozenda
✅ Web Scrapper Chrome Extension

Why Web Scraping?

Image 2

As previously mentioned, web scraping serves the purpose of retrieving data from online sources. However, it is essential to understand how to effectively utilize that unprocessed data. This raw information can be applied across a multitude of domains. Let’s explore the applications of web scraping:

Dynamic Price Monitoring

It is commonly employed to gather information from various e-commerce platforms, allowing for the comparison of product prices and the formulation of advantageous pricing strategies. Utilizing web-scraped data for price surveillance enables businesses to stay informed about market trends and supports dynamic pricing practices. This approach guarantees that companies consistently maintain a competitive edge over their rivals.

Dynamic Price Monitoring

Web Scraping serves as an excellent tool for analyzing market trends. It allows for the extraction of valuable insights related to a specific market. Large enterprises often require extensive data sets, and web scraping ensures that this data is collected with a high degree of reliability and precision.

Email Gathering

Numerous organizations leverage personal email information for the purposes of email marketing. This enables them to focus on particular segments of their audience for their promotional efforts.

News and Content Monitoring

A singular news cycle has the potential to generate a significant impact or pose a real danger to your enterprise. If your organization relies heavily on the news assessments of a particular entity, it often finds itself featured in various news outlets. In this context, web scraping serves as the optimal method for tracking and analyzing the most vital narratives. News reports and social media channels can directly affect market fluctuations.

Social Media Scraping

Web scraping serves a crucial function in the process of gathering data from social media platforms like Twitter, Facebook, and Instagram, enabling users to identify current trending topics.

Research and Development

An extensive array of data, encompassing general information, statistical figures, and temperature readings, is extracted from websites. This data is then examined and utilized for conducting surveys or for research and development purposes.

Why use Python for Web Scraping?

While there are numerous well-known programming languages available, what are the reasons for selecting Python over others for web scraping tasks? In the following section, we will outline a series of features that position Python as the most advantageous programming language for web scraping activities.

Dynamically Typed

In Python, there's no requirement to explicitly declare data types for variables; we can utilize the variable directly in any context where it is needed. This approach not only conserves time but also accelerates the overall process. Python employs its own classes to ascertain the data type associated with a variable.

A vast collection of libraries

Python boasts a comprehensive selection of libraries, including but not limited to NumPy, Matplotlib, Pandas, and Scipy, which offer versatility for a wide array of applications. This makes it an ideal choice for nearly every developing sector, as well as for web scraping tasks to extract data and execute manipulations.

Less Code

The primary goal of web scraping is to enhance efficiency and save time. However, what happens if you end up investing more time in coding the solution? This is where Python comes into play, as it allows users to accomplish tasks with just a few lines of code.

Open-Source Community

Python is an open-source programming language, indicating that it is accessible to all at no cost. It boasts one of the largest communities globally, where individuals can find assistance if they encounter difficulties while working with Python code.

The basics of web scraping

Web scraping is composed of two fundamental elements: a web crawler and a web scraper. To put it simply, you can think of the web crawler as a horse, while the scraper acts as the chariot. The crawler paves the way for the scraper by retrieving the necessary data. Let’s delve deeper into these two components of web scraping:

The crawler

Image 3

A web crawler is commonly referred to as a "spider." This technology, which falls under the umbrella of artificial intelligence, navigates the internet to index and locate content based on specified links. It seeks out the pertinent information requested by the developer.

The Scrapper

Image 4

A web scraper is a specialized application created to efficiently and rapidly gather data from multiple websites. The design and complexity of web scrapers can differ significantly based on the specific requirements of the projects.

How does Web Scraping work?

The subsequent steps outline how to execute web scraping. Let's delve into the mechanics of web scraping.

Step 1: Find the URL that you want to scrape

Initially, it is crucial to comprehend the data needs specific to your project. A single webpage or an entire website comprises a vast array of information. Therefore, it is essential to extract only the pertinent data. In straightforward terms, the developer must be well-acquainted with the data requirements.

Step 2: Inspecting the Page

The information is retrieved in its unprocessed HTML format, necessitating meticulous parsing to eliminate any extraneous details from the raw data. At times, the data may be straightforward, consisting solely of a name and address, or it could be intricate, encompassing high-dimensional datasets related to weather patterns and stock market fluctuations.

Step 3: Write the code

To retrieve specific information, present pertinent data, and execute the code, please follow the instructions below:

Set Up Your Environment: Ensure that you have the necessary programming environment ready for executing the code. This may include installing required libraries or frameworks.
Code Example: Below is an illustrative code snippet that demonstrates how to extract information from a dataset, process it, and display the results.

Example


import pandas as pd

# Load the dataset
data = pd.read_csv('your_data_file.csv')

# Display the first few rows of the dataset
print(data.head())

# Extract relevant information
extracted_info = data[['column1', 'column2']]  # Replace with your specific columns

# Provide relevant information
summary = extracted_info.describe()
print(summary)

# Run the code to generate and display the output
if __name__ == "__main__":
    extracted_info.to_csv('extracted_information.csv', index=False)
    print("Information has been successfully extracted and saved.")

Execution: To run the above code, save it in a Python file (for example, extract_info.py) and execute it in your terminal or command prompt by using the command:

Example


python extract_info.py

Review Output: After running the code, check the console output for the summary statistics and confirm that a new file named extracted_information.csv has been created in your working directory, containing the extracted data.

Make sure to replace 'yourdatafile.csv' with the actual path to your dataset and adjust the column names as necessary for your specific use case.

Step 4: Store the data in the file

Save that data in the necessary formats such as CSV, XML, or JSON files.

Getting Started with Web Scraping

Python boasts an extensive array of libraries, among which is a particularly valuable one specifically designed for web scraping. Let’s delve into the essential library that Python offers for this purpose.

Libraries for web scraping

Selenium

Selenium is a freely available automated testing framework designed for web applications. It is utilized for monitoring browser interactions. To set up this library, enter the subsequent command in your terminal.

Example


pip install selenium

Note: It is good to use the PyCharm IDE.

Image 5

Pandas

The pandas library serves the purpose of data manipulation and analysis. It allows users to retrieve data and save it in a preferred format.

BeautifulSoup

BeautifulSoup is a Python package utilized for extracting data from HTML and XML documents. Its primary purpose is web scraping. It operates alongside a parser to facilitate an intuitive method for traversing, searching, and altering the parse tree. The most recent release of BeautifulSoup is version 4.8.1.

Let us delve into the BeautifulSoup library comprehensively, along with the steps for installing BeautifulSoup:

To install BeautifulSoup, simply enter the following command:

Example


pip install bs4

Installing a parser

BeautifulSoup is compatible with the HTML parser as well as a variety of third-party Python parsers. You have the option to install any of these based on your specific requirements. Below is a compilation of the parsers available with BeautifulSoup:

Parser	Typical usage
Python's html.parser	BeautifulSoup(markup,"html.parser")
lxml's HTML parser	BeautifulSoup(markup,"lxml")
lxml's XML parser	BeautifulSoup(markup,"lxml-xml")
Html5lib	BeautifulSoup(markup,"html5lib")

We recommend that you install html5lib parser because it is more suitable for the newer version of Python, or you can install the lxml parser.

Type the following command in your terminal:

Example


pip install html5lib

Image 6

BeautifulSoup Objects: Tag, Attributes, and NavigableString

BeautifulSoup is utilized to convert intricate HTML documents into a structured hierarchy of Python objects. However, there are several fundamental categories of objects that are predominantly employed:

A Tag object represents an original document in either XML or HTML format.

Example


soup = bs4.BeautifulSoup("<b class = "boldest">Extremely bold</b>)

tag = soup.b

type(tag)

Output:

Output


<class "bs4.element.Tag">

A tag is composed of numerous attributes and methods; however, the most crucial aspects of a tag are its name and its attributes.

Name

Every tag has a name, accessible as .name:

Example


tag.name

Attributes

A tag can possess multiple attributes. The tag <b id = "boldest"> includes an attribute named "id" with the value set to "boldest". We are able to retrieve a tag's attributes by handling the tag as if it were a dictionary.

Example


tag[id]

Attributes of a tag can be added, deleted, or altered. This can be accomplished by treating the tag as if it were a dictionary.

Example


# add the element

tag['id'] = 'verybold'

tag['another-attribute'] = 1

tag

# delete the tag

del tag['id']

Multi-valued Attributes

In HTML5, certain attributes are capable of holding multiple values. The class attribute, which can include multiple CSS classes, is the most frequently encountered multivalued attribute. Additional attributes that support multiple values include rel, rev, accept-charset, headers, and accesskey.

Example


class_is_multi= { '*' : 'class'}

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)

xml_soup.p['class']

# [u'body', u'strikeout']

NavigableString

In BeautifulSoup, a string pertains to the textual content found within a tag. The library employs the NavigableString class to encapsulate these segments of text.

Example


tag.string

# u'Extremely bold'

type(tag.string)

# <class 'bs4.element.NavigableString'>

A string is characterized by its immutability, indicating that it cannot be altered directly. However, you can substitute it with a different string by utilizing the replace_with method.

Example


tag.string.replace_with("No longer bold")

tag

In certain situations, when you need to utilize a NavigableString beyond the BeautifulSoup context, employing the unicode function will convert it into a standard Python Unicode string.

BeautifulSoup object

The BeautifulSoup instance encapsulates the entire parsed document in its entirety. Frequently, it can be utilized similarly to a Tag object. This indicates that it is compatible with the majority of the methods outlined for traversing and querying the tree structure.

Example


doc=BeautifulSoup("<document><content/>INSERT FOOTER HERE</document","xml")

footer=BeautifulSoup("<footer>Here's the footer</footer>","xml")

doc.find(text="INSERT FOOTER HERE").replace_with(footer)

print(doc)

Output:

Output


?xml version="1.0" encoding="utf-8"?>

# <document><content/><footer>Here's the footer</footer></document>

Web Scraping Example

To illustrate the concept of web scraping in a practical manner, we will extract data from a specific webpage and conduct a thorough inspection of the entire page.

To begin, navigate to your preferred Wikipedia entry and conduct a thorough inspection of the entire page. Prior to extracting any data from the webpage, it is crucial to confirm that your needs are adequately addressed. Take a look at the following code:

Example


#importing the BeautifulSoup Library

importbs4

import requests

#Creating the requests

res = requests.get("https://en.wikipedia.org/wiki/Machine_learning")

print("The object type:",type(res))

# Convert the request object to the Beautiful Soup Object

soup = bs4.BeautifulSoup(res.text,'html5lib')

print("The object type:",type(soup)

Output:

Output


The object type <class 'requests.models.Response'>

Convert the object into: <class 'bs4.BeautifulSoup'>

In the following lines of code, we are extracting all headings of a webpage by class name. Here, front-end knowledge plays an essential role in inspecting the webpage.

Example


soup.select('.mw-headline')

for i in soup.select('.mw-headline'):

print(i.text,end = ',')

Output:

Output


Overview,Machine learning tasks,History and relationships to other fields,Relation to data mining,Relation to optimization,Relation to statistics, Theory,Approaches,Types of learning algorithms,Supervised learning,Unsupervised learning,Reinforcement learning,Self-learning,Feature learning,Sparse dictionary learning,Anomaly detection,Association rules,Models,Artificial neural networks,Decision trees,Support vector machines,Regression analysis,Bayesian networks,Genetic algorithms,Training models,Federated learning,Applications,Limitations,Bias,Model assessments,Ethics,Software,Free and open-source software,Proprietary software with free and open-source editions,Proprietary software,Journals,Conferences,See also,References,Further reading,External links,

Explanation:

In the code presented above, we have included the Beautiful Soup 4 (bs4) and requests libraries. On the third line, we instantiated a res object to dispatch a request to the specified webpage. As you can see, we successfully retrieved all the headings present on the webpage.

Image 7

Webpage of Wikipedia Learning

Let’s explore another example; we will perform a GET request to a specified URL and construct a parse Tree object (referred to as soup) utilizing BeautifulSoup along with Python's built-in "html5lib" parser.

In this section, we will extract data from the webpage provided in the specified link. Please take a look at the code below:

Example


following code:

# importing the libraries

from bs4 import BeautifulSoup

import requests

url=""

# Make a GET request to fetch the raw HTML content

html_content = requests.get(url).text

# Parse the html content

soup = BeautifulSoup(html_content, "html5lib")

print(soup.prettify()) # print the parsed data of html

The code provided will render the entire HTML markup for the homepage of the C# Tutorial.

By utilizing the BeautifulSoup object, referred to as soup, we can extract the necessary data table. Now, let’s display some intriguing details using the soup object:

Let's print the title of the web page.

Example


print(soup.title)

Output:

Output


<title>Tutorials List</title>

In the aforementioned output, the title is presented alongside the HTML tag. If you prefer to display the text without any tags, you can utilize the following code:

Example


print(soup.title.text)

Output:

Output


Tutorials List

We can retrieve the complete hyperlink on the webpage, including its attributes like href, title, and inner text. Examine the following code:

Example


for link in soup.find_all("a"):

print("Inner Text is: {}".format(link.text))

print("Title is: {}".format(link.get("title")))

print("href is: {}".format(link.get("href")))

Output:

Output


href is: https://www.facebook.com/C# Tutorial

Inner Text is:

The title is: None

href is: https://twitter.com/pageC# TutorialTech

Inner Text is:

The title is: None

href is: https://www.youtube.com/channel/UCUnYvQVCrJoFWZhKK3O2xLg

Inner Text is:

The title is: None

href is: https://example.blogspot.com

Inner Text is: Learn Python

Title is: None

href is: python-tutorial

Inner Text is: Learn Data Structures

Title is: None

href is: data-structure-tutorial

Inner Text is: Learn C Programming

Title is: None

href is: c-programming-language-tutorial

Inner Text is: Learn C++ Tutorial

Demo: Scraping Data from Flipkart Website

In this demonstration, we will extract information regarding mobile phone prices, ratings, and model names from Flipkart, a well-known e-commerce platform. To successfully complete this task, the following prerequisites must be met:

Prerequisites:

✅ Python 2.x or Python 3.x with Selenium, BeautifulSoup, and Pandas libraries installed.
✅ Google Chrome browser
✅ Scraping Parser, such as HTML.parser, XML, etc.

Step 1: Find the desired URL to scrape

The first step involves identifying the URL that you wish to scrape. In this instance, we will be gathering information about mobile phones from Flipkart. The URL for this specific page is https://www.flipkart.com/search?q=iphones&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off..

Step 2: Inspecting the page

It is essential to thoroughly examine the webpage since the information is typically embedded within specific tags. Therefore, we must conduct an inspection to identify and select the appropriate tag. To initiate the inspection, simply right-click on the element and choose "inspect" from the context menu.

Step 3: Find the data for extracting

Retrieve the Price, Name, and Rating, which can be found within the "div" tag, in that specific order.

Step 4: Write the Code

Example


from bs4 import BeautifulSoup as soup

from urllib.request import urlopen as uReq

# Request from the webpage

myurl = "https://www.flipkart.com/search?q=iphones&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off"

uClient  = uReq(myurl)

page_html = uClient.read()

uClient.close()

page_soup = soup(page_html, features="html.parser")

# print(soup.prettify(containers[0]))

# This variable holds all html of the webpage

containers = page_soup.find_all("div",{"class": "_3O0U0u"})

# container = containers[0]

# # print(soup.prettify(container))

#

# price = container.find_all("div",{"class": "col col-5-12 _2o7WAb"})

# print(price[0].text)

#

# ratings = container.find_all("div",{"class": "niH0FQ"})

# print(ratings[0].text)

#

# #

# # print(len(containers))

# print(container.div.img["alt"])

# Creating a CSV File that will store all data

filename = "product1.csv"

f = open(filename,"w")

headers = "Product_Name,Pricing,Ratings\n"

f.write(headers)

for container in containers:

    product_name = container.div.img["alt"]

    price_container = container.find_all("div", {"class": "col col-5-12 _2o7WAb"})

    price = price_container[0].text.strip()

    rating_container = container.find_all("div",{"class":"niH0FQ"})

    ratings = rating_container[0].text

# print("product_name:"+product_name)

    # print("price:"+price)

    # print("ratings:"+ str(ratings))

     edit_price = ''.join(price.split(','))

     sym_rupee = edit_price.split("?")

     add_rs_price = "Rs"+sym_rupee[1]

     split_price = add_rs_price.split("E")

     final_price = split_price[0]

     split_rating = str(ratings).split(" ")

     final_rating = split_rating[0]

     print(product_name.replace(",", "|")+","+final_price+","+final_rating+"\n")

f.write(product_name.replace(",", "|")+","+final_price+","+final_rating+"\n")

f.close()

Output:

Image 8

We extracted the specifications of the iPhone and stored that information in a CSV file, which is evident in the output. In the preceding code, we included comments on certain lines for the sake of testing. Feel free to eliminate those comments and examine the resulting output.

Conclusion

In this tutorial, we have explored the topic of web scraping, covering everything from fundamental concepts to practical examples. We provided a demonstration of scraping data from the prominent online retail platform, Flipkart. The legality surrounding web scraping was examined to ensure compliance with relevant regulations. We delved into various applications of web scraping, including Dynamic Price Monitoring, Social Media Data Extraction, Email Scraping, News Aggregation, Content Surveillance, and Research and Development. Additionally, we investigated the significance of utilizing Web Scraping in Python, highlighting features such as its Dynamically Typed nature, an Extensive Array of Libraries, Reduced Code Complexity, and overall efficiency.

Web Scraping Using Python FAQs

1. What is web scraping?

Web Scraping is a method utilized to gather extensive data from multiple websites. The phrase "scraping" pertains to the process of retrieving information from alternate sources (web pages) and storing it in a local file.

2. Which Python libraries are commonly used for web scraping?

There are several Python libraries that are commonly used for Web Scraping, such as:

✅ Requests : It is used for sending HTTP requests.
✅ BeautifulSoup : It is used for parsing HTML/XML.
✅ Lxml : It is a fast HTML/XML parser.
✅ Selenium : It is used for scraping dynamic sites.
✅ Scrapy : It is a full-fledged web scraping framework.

3. How do you fetch a webpage using Python?

Let's see how we fetch a webpage using Python:

Example


import requests

url = "https://logic-practice.com"

response = requests.get(url)

print(response.text)

4. How to extract all links from a webpage?

Let’s explore the method for retrieving all hyperlinks from a webpage utilizing Python:

Example


import requests

from bs4 import BeautifulSoup

url = "https://logic-practice.com"

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

for link in soup.find_all("a"):

    print(link.get("href"))

5. Is web scraping legal?

Web Scraping is typically permissible under legal frameworks; however, its legality can vary from one website to another based on their specific terms and conditions. It is crucial to refrain from scraping websites that explicitly prohibit such activities. Additionally, we should adhere to responsible practices, including implementing rate limiting and rotating user agents.

Is Web Scraping legal?

Why Web Scraping?

Dynamic Price Monitoring

Dynamic Price Monitoring

Email Gathering

News and Content Monitoring

Social Media Scraping

Research and Development

Why use Python for Web Scraping?

Dynamically Typed

A vast collection of libraries

Less Code

Open-Source Community

The basics of web scraping

The crawler

The Scrapper

How does Web Scraping work?

Libraries for web scraping

Selenium

Note: It is good to use the PyCharm IDE.

Pandas

BeautifulSoup

Installing a parser

BeautifulSoup Objects: Tag, Attributes, and NavigableString

Web Scraping Example

Webpage of Wikipedia Learning

Demo: Scraping Data from Flipkart Website

Prerequisites:

Conclusion

Web Scraping Using Python FAQs

1. What is web scraping?

2. Which Python libraries are commonly used for web scraping?

3. How do you fetch a webpage using Python?

4. How to extract all links from a webpage?

5. Is web scraping legal?

Input Required