Web Scraping is a method employed to gather substantial amounts of data from various websites. The phrase "scraping" signifies the process of retrieving information from external sources (such as web pages) and storing it in a local file. For instance, imagine you are developing a project named "Mobile Phone Comparison Website," where you need to collect the prices, ratings, and model names of mobile phones to facilitate comparisons among different devices.
Gathering this information by manually visiting multiple websites can be quite time-consuming. In such scenarios, web scraping becomes invaluable, as it allows you to obtain the required data by simply writing a few lines of code.
Web scraping involves the retrieval of data from websites that is presented in an unstructured format. This process facilitates the gathering of such unstructured information and transforms it into a structured format.
Startups are inclined towards web scraping as it offers an economical and efficient method to acquire substantial volumes of data without the need for collaboration with the organization that sells the data.
Is Web Scraping legal?
In this context, a pertinent question emerges regarding the legality of web scraping. The response is that certain websites permit it when conducted within legal parameters. Web scraping serves merely as a mechanism; it can be utilized appropriately or improperly, depending on the circumstances.
Extracting data from the web can be considered unlawful when it involves scraping nonpublic information. Nonpublic data refers to information that is not available to the general public; attempting to retrieve such data constitutes a breach of legal regulations.
There are several tools available to scrape data from websites, such as:
- Scraping-bot
- Scrapper API
- Octoparse
- Import.io
- Webhose.io
- Dexi.io
- Outwit
- Diffbot
- Content Grabber
- Mozenda
- Web Scrapper Chrome Extension
Why Web Scraping?
As previously mentioned, web scraping serves the purpose of retrieving data from online sources. However, it is essential to understand how to effectively utilize that unprocessed data. This raw information can be applied across a multitude of domains. Let’s explore the applications of web scraping:
Dynamic Price Monitoring
It is commonly employed to gather information from various e-commerce platforms, allowing for the comparison of product prices and the formulation of advantageous pricing strategies. Utilizing web-scraped data for price surveillance enables businesses to stay informed about market trends and supports dynamic pricing practices. This approach guarantees that companies consistently maintain a competitive edge over their rivals.
Dynamic Price Monitoring
Web Scraping serves as an excellent tool for analyzing market trends. It allows for the extraction of valuable insights related to a specific market. Large enterprises often require extensive data sets, and web scraping ensures that this data is collected with a high degree of reliability and precision.
Email Gathering
Numerous organizations leverage personal email information for the purposes of email marketing. This enables them to focus on particular segments of their audience for their promotional efforts.
News and Content Monitoring
A singular news cycle has the potential to generate a significant impact or pose a real danger to your enterprise. If your organization relies heavily on the news assessments of a particular entity, it often finds itself featured in various news outlets. In this context, web scraping serves as the optimal method for tracking and analyzing the most vital narratives. News reports and social media channels can directly affect market fluctuations.
Social Media Scraping
Web scraping serves a crucial function in the process of gathering data from social media platforms like Twitter, Facebook, and Instagram, enabling users to identify current trending topics.
Research and Development
An extensive array of data, encompassing general information, statistical figures, and temperature readings, is extracted from websites. This data is then examined and utilized for conducting surveys or for research and development purposes.
Why use Python for Web Scraping?
While there are numerous well-known programming languages available, what are the reasons for selecting Python over others for web scraping tasks? In the following section, we will outline a series of features that position Python as the most advantageous programming language for web scraping activities.
Dynamically Typed
In Python, there's no requirement to explicitly declare data types for variables; we can utilize the variable directly in any context where it is needed. This approach not only conserves time but also accelerates the overall process. Python employs its own classes to ascertain the data type associated with a variable.
A vast collection of libraries
Python boasts a comprehensive selection of libraries, including but not limited to NumPy, Matplotlib, Pandas, and Scipy, which offer versatility for a wide array of applications. This makes it an ideal choice for nearly every developing sector, as well as for web scraping tasks to extract data and execute manipulations.
Less Code
The primary goal of web scraping is to enhance efficiency and save time. However, what happens if you end up investing more time in coding the solution? This is where Python comes into play, as it allows users to accomplish tasks with just a few lines of code.
Open-Source Community
Python is an open-source programming language, indicating that it is accessible to all at no cost. It boasts one of the largest communities globally, where individuals can find assistance if they encounter difficulties while working with Python code.
The basics of web scraping
Web scraping is composed of two fundamental elements: a web crawler and a web scraper. To put it simply, you can think of the web crawler as a horse, while the scraper acts as the chariot. The crawler paves the way for the scraper by retrieving the necessary data. Let’s delve deeper into these two components of web scraping:
The crawler
A web crawler is commonly referred to as a "spider." This technology, which falls under the umbrella of artificial intelligence, navigates the internet to index and locate content based on specified links. It seeks out the pertinent information requested by the developer.
The Scrapper
A web scraper is a specialized application created to efficiently and rapidly gather data from multiple websites. The design and complexity of web scrapers can differ significantly based on the specific requirements of the projects.
How does Web Scraping work?
The subsequent steps outline how to execute web scraping. Let's delve into the mechanics of web scraping.
Step 1: Find the URL that you want to scrape
Initially, it is crucial to comprehend the data needs specific to your project. A single webpage or an entire website comprises a vast array of information. Therefore, it is essential to extract only the pertinent data. In straightforward terms, the developer must be well-acquainted with the data requirements.
Step 2: Inspecting the Page
The information is retrieved in its unprocessed HTML format, necessitating meticulous parsing to eliminate any extraneous details from the raw data. At times, the data may be straightforward, consisting solely of a name and address, or it could be intricate, encompassing high-dimensional datasets related to weather patterns and stock market fluctuations.
Step 3: Write the code
To retrieve specific information, present pertinent data, and execute the code, please follow the instructions below:
- Set Up Your Environment: Ensure that you have the necessary programming environment ready for executing the code. This may include installing required libraries or frameworks.
- Code Example: Below is an illustrative code snippet that demonstrates how to extract information from a dataset, process it, and display the results.
- Execution: To run the above code, save it in a Python file (for example,
extract_info.py) and execute it in your terminal or command prompt by using the command: - Review Output: After running the code, check the console output for the summary statistics and confirm that a new file named
extracted_information.csvhas been created in your working directory, containing the extracted data.
import pandas as pd
# Load the dataset
data = pd.read_csv('your_data_file.csv')
# Display the first few rows of the dataset
print(data.head())
# Extract relevant information
extracted_info = data[['column1', 'column2']] # Replace with your specific columns
# Provide relevant information
summary = extracted_info.describe()
print(summary)
# Run the code to generate and display the output
if __name__ == "__main__":
extracted_info.to_csv('extracted_information.csv', index=False)
print("Information has been successfully extracted and saved.")
python extract_info.py
Make sure to replace 'yourdatafile.csv' with the actual path to your dataset and adjust the column names as necessary for your specific use case.
Step 4: Store the data in the file
Save that data in the necessary formats such as CSV, XML, or JSON files.
Getting Started with Web Scraping
Python boasts an extensive array of libraries, among which is a particularly valuable one specifically designed for web scraping. Let’s delve into the essential library that Python offers for this purpose.
Libraries for web scraping
Selenium
Selenium is a freely available automated testing framework designed for web applications. It is utilized for monitoring browser interactions. To set up this library, enter the subsequent command in your terminal.
pip install selenium
Note: It is good to use the PyCharm IDE.
Pandas
The pandas library serves the purpose of data manipulation and analysis. It allows users to retrieve data and save it in a preferred format.
BeautifulSoup
BeautifulSoup is a Python package utilized for extracting data from HTML and XML documents. Its primary purpose is web scraping. It operates alongside a parser to facilitate an intuitive method for traversing, searching, and altering the parse tree. The most recent release of BeautifulSoup is version 4.8.1.
Let us delve into the BeautifulSoup library comprehensively, along with the steps for installing BeautifulSoup:
To install BeautifulSoup, simply enter the following command:
pip install bs4
Installing a parser
BeautifulSoup is compatible with the HTML parser as well as a variety of third-party Python parsers. You have the option to install any of these based on your specific requirements. Below is a compilation of the parsers available with BeautifulSoup:
| Parser | Typical usage |
|---|---|
| Python's html.parser | BeautifulSoup(markup,"html.parser") |
| lxml's HTML parser | BeautifulSoup(markup,"lxml") |
| lxml's XML parser | BeautifulSoup(markup,"lxml-xml") |
| Html5lib | BeautifulSoup(markup,"html5lib") |
We recommend that you install html5lib parser because it is more suitable for the newer version of Python, or you can install the lxml parser.
Type the following command in your terminal:
pip install html5lib
BeautifulSoup Objects: Tag, Attributes, and NavigableString
BeautifulSoup is utilized to convert intricate HTML documents into a structured hierarchy of Python objects. However, there are several fundamental categories of objects that are predominantly employed:
- Tag
A Tag object represents an original document in either XML or HTML format.
soup = bs4.BeautifulSoup("<b class = "boldest">Extremely bold</b>)
tag = soup.b
type(tag)
Output:
<class "bs4.element.Tag">
A tag is composed of numerous attributes and methods; however, the most crucial aspects of a tag are its name and its attributes.
- Name
Every tag has a name, accessible as .name:
tag.name
- Attributes
A tag can possess multiple attributes. The tag <b id = "boldest"> includes an attribute named "id" with the value set to "boldest". We are able to retrieve a tag's attributes by handling the tag as if it were a dictionary.
tag[id]
Attributes of a tag can be added, deleted, or altered. This can be accomplished by treating the tag as if it were a dictionary.
# add the element
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag
# delete the tag
del tag['id']
- Multi-valued Attributes
In HTML5, certain attributes are capable of holding multiple values. The class attribute, which can include multiple CSS classes, is the most frequently encountered multivalued attribute. Additional attributes that support multiple values include rel, rev, accept-charset, headers, and accesskey.
class_is_multi= { '*' : 'class'}
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)
xml_soup.p['class']
# [u'body', u'strikeout']
- NavigableString
In BeautifulSoup, a string pertains to the textual content found within a tag. The library employs the NavigableString class to encapsulate these segments of text.
tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>
A string is characterized by its immutability, indicating that it cannot be altered directly. However, you can substitute it with a different string by utilizing the replace_with method.
tag.string.replace_with("No longer bold")
tag
In certain situations, when you need to utilize a NavigableString beyond the BeautifulSoup context, employing the unicode function will convert it into a standard Python Unicode string.
- BeautifulSoup object
The BeautifulSoup instance encapsulates the entire parsed document in its entirety. Frequently, it can be utilized similarly to a Tag object. This indicates that it is compatible with the majority of the methods outlined for traversing and querying the tree structure.
doc=BeautifulSoup("<document><content/>INSERT FOOTER HERE</document","xml")
footer=BeautifulSoup("<footer>Here's the footer</footer>","xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
print(doc)
Output:
?xml version="1.0" encoding="utf-8"?>
# <document><content/><footer>Here's the footer</footer></document>
Web Scraping Example
To illustrate the concept of web scraping in a practical manner, we will extract data from a specific webpage and conduct a thorough inspection of the entire page.
To begin, navigate to your preferred Wikipedia entry and conduct a thorough inspection of the entire page. Prior to extracting any data from the webpage, it is crucial to confirm that your needs are adequately addressed. Take a look at the following code:
#importing the BeautifulSoup Library
importbs4
import requests
#Creating the requests
res = requests.get("https://en.wikipedia.org/wiki/Machine_learning")
print("The object type:",type(res))
# Convert the request object to the Beautiful Soup Object
soup = bs4.BeautifulSoup(res.text,'html5lib')
print("The object type:",type(soup)
Output:
The object type <class 'requests.models.Response'>
Convert the object into: <class 'bs4.BeautifulSoup'>
In the following lines of code, we are extracting all headings of a webpage by class name. Here, front-end knowledge plays an essential role in inspecting the webpage.
soup.select('.mw-headline')
for i in soup.select('.mw-headline'):
print(i.text,end = ',')
Output:
Overview,Machine learning tasks,History and relationships to other fields,Relation to data mining,Relation to optimization,Relation to statistics, Theory,Approaches,Types of learning algorithms,Supervised learning,Unsupervised learning,Reinforcement learning,Self-learning,Feature learning,Sparse dictionary learning,Anomaly detection,Association rules,Models,Artificial neural networks,Decision trees,Support vector machines,Regression analysis,Bayesian networks,Genetic algorithms,Training models,Federated learning,Applications,Limitations,Bias,Model assessments,Ethics,Software,Free and open-source software,Proprietary software with free and open-source editions,Proprietary software,Journals,Conferences,See also,References,Further reading,External links,
Explanation:
In the code presented above, we have included the Beautiful Soup 4 (bs4) and requests libraries. On the third line, we instantiated a res object to dispatch a request to the specified webpage. As you can see, we successfully retrieved all the headings present on the webpage.
Webpage of Wikipedia Learning
Let’s explore another example; we will perform a GET request to a specified URL and construct a parse Tree object (referred to as soup) utilizing BeautifulSoup along with Python's built-in "html5lib" parser.
In this section, we will extract data from the webpage provided in the specified link. Please take a look at the code below:
following code:
# importing the libraries
from bs4 import BeautifulSoup
import requests
url=""
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "html5lib")
print(soup.prettify()) # print the parsed data of html
The code provided will render the entire HTML markup for the homepage of the C# Tutorial.
By utilizing the BeautifulSoup object, referred to as soup, we can extract the necessary data table. Now, let’s display some intriguing details using the soup object:
Let's print the title of the web page.
print(soup.title)
Output:
<title>Tutorials List</title>
In the aforementioned output, the title is presented alongside the HTML tag. If you prefer to display the text without any tags, you can utilize the following code:
print(soup.title.text)
Output:
Tutorials List
We can retrieve the complete hyperlink on the webpage, including its attributes like href, title, and inner text. Examine the following code:
for link in soup.find_all("a"):
print("Inner Text is: {}".format(link.text))
print("Title is: {}".format(link.get("title")))
print("href is: {}".format(link.get("href")))
Output:
href is: https://www.facebook.com/C# Tutorial
Inner Text is:
The title is: None
href is: https://twitter.com/pageC# TutorialTech
Inner Text is:
The title is: None
href is: https://www.youtube.com/channel/UCUnYvQVCrJoFWZhKK3O2xLg
Inner Text is:
The title is: None
href is: https://example.blogspot.com
Inner Text is: Learn Python
Title is: None
href is: python-tutorial
Inner Text is: Learn Data Structures
Title is: None
href is: data-structure-tutorial
Inner Text is: Learn C Programming
Title is: None
href is: c-programming-language-tutorial
Inner Text is: Learn C++ Tutorial
Demo: Scraping Data from Flipkart Website
In this demonstration, we will extract information regarding mobile phone prices, ratings, and model names from Flipkart, a well-known e-commerce platform. To successfully complete this task, the following prerequisites must be met:
Prerequisites:
- Python 2.x or Python 3.x with Selenium, BeautifulSoup, and Pandas libraries installed.
- Google Chrome browser
- Scraping Parser, such as HTML.parser, XML, etc.
Step 1: Find the desired URL to scrape
The first step involves identifying the URL that you wish to scrape. In this instance, we will be gathering information about mobile phones from Flipkart. The URL for this specific page is https://www.flipkart.com/search?q=iphones&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off..
Step 2: Inspecting the page
It is essential to thoroughly examine the webpage since the information is typically embedded within specific tags. Therefore, we must conduct an inspection to identify and select the appropriate tag. To initiate the inspection, simply right-click on the element and choose "inspect" from the context menu.
Step 3: Find the data for extracting
Retrieve the Price, Name, and Rating, which can be found within the "div" tag, in that specific order.
Step 4: Write the Code
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
# Request from the webpage
myurl = "https://www.flipkart.com/search?q=iphones&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off"
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, features="html.parser")
# print(soup.prettify(containers[0]))
# This variable holds all html of the webpage
containers = page_soup.find_all("div",{"class": "_3O0U0u"})
# container = containers[0]
# # print(soup.prettify(container))
#
# price = container.find_all("div",{"class": "col col-5-12 _2o7WAb"})
# print(price[0].text)
#
# ratings = container.find_all("div",{"class": "niH0FQ"})
# print(ratings[0].text)
#
# #
# # print(len(containers))
# print(container.div.img["alt"])
# Creating a CSV File that will store all data
filename = "product1.csv"
f = open(filename,"w")
headers = "Product_Name,Pricing,Ratings\n"
f.write(headers)
for container in containers:
product_name = container.div.img["alt"]
price_container = container.find_all("div", {"class": "col col-5-12 _2o7WAb"})
price = price_container[0].text.strip()
rating_container = container.find_all("div",{"class":"niH0FQ"})
ratings = rating_container[0].text
# print("product_name:"+product_name)
# print("price:"+price)
# print("ratings:"+ str(ratings))
edit_price = ''.join(price.split(','))
sym_rupee = edit_price.split("?")
add_rs_price = "Rs"+sym_rupee[1]
split_price = add_rs_price.split("E")
final_price = split_price[0]
split_rating = str(ratings).split(" ")
final_rating = split_rating[0]
print(product_name.replace(",", "|")+","+final_price+","+final_rating+"\n")
f.write(product_name.replace(",", "|")+","+final_price+","+final_rating+"\n")
f.close()
Output:
We extracted the specifications of the iPhone and stored that information in a CSV file, which is evident in the output. In the preceding code, we included comments on certain lines for the sake of testing. Feel free to eliminate those comments and examine the resulting output.
Conclusion
In this tutorial, we have explored the topic of web scraping, covering everything from fundamental concepts to practical examples. We provided a demonstration of scraping data from the prominent online retail platform, Flipkart. The legality surrounding web scraping was examined to ensure compliance with relevant regulations. We delved into various applications of web scraping, including Dynamic Price Monitoring, Social Media Data Extraction, Email Scraping, News Aggregation, Content Surveillance, and Research and Development. Additionally, we investigated the significance of utilizing Web Scraping in Python, highlighting features such as its Dynamically Typed nature, an Extensive Array of Libraries, Reduced Code Complexity, and overall efficiency.
Web Scraping Using Python FAQs
1. What is web scraping?
Web Scraping is a method utilized to gather extensive data from multiple websites. The phrase "scraping" pertains to the process of retrieving information from alternate sources (web pages) and storing it in a local file.
2. Which Python libraries are commonly used for web scraping?
There are several Python libraries that are commonly used for Web Scraping, such as:
- Requests : It is used for sending HTTP requests.
- BeautifulSoup : It is used for parsing HTML/XML.
- Lxml : It is a fast HTML/XML parser.
- Selenium : It is used for scraping dynamic sites.
- Scrapy : It is a full-fledged web scraping framework.
3. How do you fetch a webpage using Python?
Let's see how we fetch a webpage using Python:
import requests
url = "https://logic-practice.com"
response = requests.get(url)
print(response.text)
4. How to extract all links from a webpage?
Let’s explore the method for retrieving all hyperlinks from a webpage utilizing Python:
import requests
from bs4 import BeautifulSoup
url = "https://logic-practice.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.find_all("a"):
print(link.get("href"))
5. Is web scraping legal?
Web Scraping is typically permissible under legal frameworks; however, its legality can vary from one website to another based on their specific terms and conditions. It is crucial to refrain from scraping websites that explicitly prohibit such activities. Additionally, we should adhere to responsible practices, including implementing rate limiting and rotating user agents.