Web Scraping with Beautiful Soup

Introduction:

Web scraping is the process of extracting data from various websites and parsing it. In other words, it’s a technique to extract unstructured data and store that data either in a local file or in a database. There are many ways to collect data that involve a huge amount of hard work and consume a lot of time. Web scraping can save programmers many hours.

The basic steps involved in web scraping are:

  • Loading the document (HTML content)

  • Parsing the document

  • Extraction

  • Transformation

Beautiful Soup:

Beautiful Soup is a Python web scraping library that allows us to parse and scrape HTML and XML pages. You can search, navigate, and modify data using a parser. It’s versatile and saves a lot of time. In this article we will learn how to scrape data using Beautiful Soup.

Step 1: Installation

Beautiful Soup can be installed using the pip command. You can also try pip3 if pip is not working.

pip install requests

pip install beautifulsoup4

The requests module is used for getting HTML content.

Step 2: Inspect the Source

The next step is to inspect the website that you want to scrape. Start by opening your site in a browser. Go through the structure of the site and find the part of the webpage you want to scrape. Next, inspect the site using the developer tools by going to More tools>Developer tools. Finally, open the Elements tab in your developer tools.

image

Step 3: Get the HTML Content

Next, get the HTML content from a web page. We use the requests module for doing this task. We call the “get” function by passing the URL of the webpage as an argument to this function as shown below:

1

2

3

4

5

6

7

#import requests library

import

requests

#the website URL

url_link =

"https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States"

result = requests.get(url_link) .text print(result)

In the above code, we are issuing an HTTP GET request to the specified URL. We are then storing the HTML data that is received by the server in a Python object. The .text attribute will print the HTML data.
Step 4: Parsing an HTML Page with Beautiful Soup
Now that we have the HTML content in a document, the next step is to parse and process the data. For doing so, we import this library, create an instance of BeautifulSoup class and process the data.

1

2

3

4

5

6

7

8

9

10

11

from

bs4

import

BeautifulSoup

#import requests library

import

requests

#the website URL

url_link =

"https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States"

result = requests.get(url_link) .text doc = BeautifulSoup(result,

"html.parser"

) print(doc.prettify())

The prettify() function will allow us to print the HTML content in a nested form that is easy to read and will help extract the available tags that are needed.
There are two methods to find the tags: find and find_all().
Find(): This method finds the first matched element.
Find_all(): This method finds all the matched elements.

Find Elements by ID:

We all know that every element of the HTML page is assigned a unique ID attribute. Let us now try to find an element by using the value of the ID attribute. For example, I am looking to find an ID attribute that has the value “content” as shown below:

image

1

2

res = doc.find(id =

"content"

) print(res)

image

Find Elements by Class Name:

If we want to find an element by class name in the above res for example, we will extract the h1 element with the class name “firstHeading”,
<h1 class=“firstHeading” id=“firstHeading”>List of states and territories of the United States</h1>

1

2

heading = res.find(class_ =

"firstHeading"

) print(heading)
[<h1 class=“firstHeading” id=“firstHeading”>List of states and territories of the United States</h1>]

Extracting Text From HTML Elements

If we want to find only the text from the above heading tag, we can do so by the following:

print(heading.text)

“List of states and territories of the United States”

Accessing the Nested Tags:

image

For example, we will try to find all the h2 tags in the <main> element with id=”content”.

1

2

3

res = doc.find(id =

"content"

)

for

ele

in

res: print(res.find(

"h2"

))

1

2

3

4

5

6

7

8

9

10

11

<

h2

id

=

"mw-toc-heading"

>Contents

</

h2

>

<

h2

id

=

"mw-toc-heading"

> Contents

< /

h2

>

<

h2

id

=

"mw-toc-heading"

> Contents

< /

h2

>

<

h2

id

=

"mw-toc-heading"

> Contents

< /

h2

>

<

h2

id

=

"mw-toc-heading"

> Contents

< /

h2

>

<

h2

id

=

"mw-toc-heading"

> Contents

< /

h2

>

<

h2

id

=

"mw-toc-heading"

> Contents

< /

h2

>

<

h2

id

=

"mw-toc-heading"

> Contents

< /

h2

>

<

h2

id

=

"mw-toc-heading"

> Contents

< /

h2

>

<

h2

id

=

"mw-toc-heading"

> Contents

< /

h2

>

<

h2

id

=

"mw-toc-heading"

> Contents

< /

h2

>

Searching Using String(text):

For example, if you want to search for the text “Set Me Free” , you can do it by using the below code.

1

2

res = doc.find_all(text =

"California"

) print(res)

['California', 'California', 'California']

Search by Passing a List

You can also pass a list to the find_all() function and Beautiful Soup will find all the elements that match any item in that list.
For example, the below code will find all the <a>, <p> and <div> tags in the document.

res=doc.find_all(["a","p","div"])

Search Using a Regular Expression

If you pass in a regular expression, Beautiful Soup will filter using the regular expression. We will have to import re as shown below.

1

2

3

4

import

re

for

str

in

doc.find_all(text = re.compile(

"1788"

)): print(str)

1

2

3

4

5

6

7

8

Jan

9

,

1788

Jan

2

,

1788

Apr

28

,

1788

Feb

6

,

1788

Jun

21

,

1788

Jul

26

,

1788

May

23

,

1788

Jun

25

,

1788

Further, if you want to get only a limited number of results, you can do so by using limit.

1

2

for

str

in

doc.find_all(text = re.compile(

"1788"

), limit =

2

): print(str)

1

2

Jan

9

,

1788

Jan

2

,

1788

Search Using CSS Selectors

Beautiful Soup has a .select method that allows us to filter using a CSS selector.

print(doc.select("title"))

You can also select tags beneath other tags as

print(doc.select("html head title"))

Finding Tags by CSS Class

1

2

3

print(doc.select(

".vector-menu-content"

))

or

print(doc.select(

"[class~=vector-menu-content]"

))

Finding Tags by ID

print(doc.select("#p-logo"))

1

2

3

<

div

id

=

"p-logo"

role

=

"banner"

>

<

a

class

=

"mw-wiki-logo"

href

=

"/wiki/Main_Page"

title

=

"Visit the main page"

>

</

a

>

</

div

>

print(doc.select("div#mw-panel"))

Testing if the Attribute Exists in a Tag

print(doc.select("footer[role]"))

print(doc.select("a[href]"))

A Simple Practical Exercise

In this exercise, we will be taking a webpage from Wikipedia with the URL (https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States) as an example. This page contains the list of states in the U.S, population, and other details. We will try to get the names of the states and the population columns of the table.

image

The initial step is to identify the text or area of the webpage which is to be scraped. To find that, simply select the area of the page, right-click and then click on inspect.
You can see that the element I am looking for is in the table with the class name “wikitable sortable plainrowheaders” and it is a string of a <a> tag that is nested inside the <th> tag.

image

Let us now write the code to fetch the data.
Install the essential libraries.

1

2

3

4

from

bs4

import

BeautifulSoup

#import requests library

import

requests

In the next step, we will use a get request by passing the URL of the webpage that is to be parsed. Further, we create a Beautiful Soup object with “html.parser”.

  1. #the website url

  2. url_link=“https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States”

  3. result=requests.get(url_link).text

  4. doc=BeautifulSoup(result, “html.parser”)

Then, we use the BeautifulSoup object created above and collect the required table data by using the class name:

  1. my_table=doc.find(“table”, class_=“wikitable sortable plainrowheaders”)

We then extract all the <th> tags in our table and finally get the text inside the <a> tags.

1

2

3

4

5

6

7

8

9

th_tags = my_table.find_all(

'th'

) names = []

for

elem

in

th_tags:

#finding the < a > tag

a_links = elem.find_all(

"a"

)

#getting the text inside the < a > tag

for

i

in

a_links: names.append(i.string) print(names)

1

[

'postal

abbreviation', '[

13

]', '[

C

]', '[

15

]', '[

16

]', '[

16

]', '[

16

]', None, '[

17

]',

'Alabama

',

'Alaska

',

'Arizona

',

'Arkansas

',

'California

',

'Colorado

',

'Connecticut

',

'Delaware

',

'Florida

',

'Georgia

',

'Hawaii

',

'Idaho

',

'Illinois

',

'Indiana

',

'Iowa

',

'Kansas

',

'Kentucky

', '[

D

]',

'Louisiana

',

'Maine

',

'Maryland

',

'Massachusetts

', '[

D

]',

'Michigan

',

'Minnesota

',

'Mississippi

',

'Missouri

',

'Montana

',

'Nebraska

',

'Nevada

',

'New

Hampshire',

'New

Jersey',

'New

Mexico',

'New

York',

'North

Carolina',

'North

Dakota',

'Ohio

',

'Oklahoma

',

'Oregon

',

'Pennsylvania

', '[

D

]',

'Rhode

Island',

'South

Carolina',

'South

Dakota',

'Tennessee

',

'Texas

',

'Utah

',

'Vermont

',

'Virginia

', '[

D

]',

'Washington

',

'West

Virginia',

'Wisconsin

',

'Wyoming

']

In the above result, you can observe that our list starts from index 9, and also we have a few [D] in between. We will prepare the final list by removing the unwanted strings.

1

2

3

4

5

6

final_list = names[

9

: ] states = []

for

str

in

final_list:

if

len(str) >

3

: states.append(str) print(states)

And finally, the result goes here:

1

[

'Alabama

',

'Alaska

',

'Arizona

',

'Arkansas

',

'California

',

'Colorado

',

'Connecticut

',

'Delaware

',

'Florida

',

'Georgia

',

'Hawaii

',

'Idaho

',

'Illinois

',

'Indiana

',

'Iowa

',

'Kansas

',

'Kentucky

',

'Louisiana

',

'Maine

',

'Maryland

',

'Massachusetts

',

'Michigan

',

'Minnesota

',

'Mississippi

',

'Missouri

',

'Montana

',

'Nebraska

',

'Nevada

',

'New

Hampshire',

'New

Jersey',

'New

Mexico',

'New

York',

'North

Carolina',

'North

Dakota',

'Ohio

',

'Oklahoma

',

'Oregon

',

'Pennsylvania

',

'Rhode

Island',

'South

Carolina',

'South

Dakota',

'Tennessee

',

'Texas

',

'Utah

',

'Vermont

',

'Virginia

',

'Washington

',

'West

Virginia',

'Wisconsin

',

'Wyoming

']

In a similar way, we will now try to scrape the population columns from the same table. When I inspect the column element, I can find that it is contained inside the <div> tags as shown below:

image

The code goes as follows:

1

2

3

4

5

divs = my_table.find_all(

"div"

) pop = []

for

i

in

divs: pop.append(i.string) print(pop)

1

[

'5

,

024

,

279

',

'7

',

'733

,

391

',

'1

',

'7

,

151

,

502

',

'9

',

'3

,

011

,

524

',

'4

',

'39

,

538

,

223

',

'53

',

'5

,

773

,

714

',

'7

',

'3

,

605

,

944

',

'5

',

'989

,

948

',

'1

',

'21

,

538

,

187

',

'27

',

'10

,

711

,

908

',

'14

',

'1

,

455

,

271

',

'2

',

'1

,

839

,

106

',

'2

',

'12

,

812

,

508

',

'18

',

'6

,

785

,

528

',

'9

',

'3

,

190

,

369

',

'4

',

'2

,

937

,

880

',

'4

',

'4

,

505

,

836

',

'6

',

'4

,

657

,

757

',

'6

',

'1

,

362

,

359

',

'2

',

'6

,

177

,

224

',

'8

',

'7

,

029

,

917

',

'9

',

'10

,

077

,

331

',

'14

',

'5

,

706

,

494

',

'8

',

'2

,

961

,

279

',

'4

',

'6

,

154

,

913

',

'8

',

'1

,

084

,

225

',

'1

',

'1

,

961

,

504

',

'3

',

'3

,

104

,

614

',

'4

',

'1

,

377

,

529

',

'2

',

'9

,

288

,

994

',

'12

',

'2

,

117

,

522

',

'3

',

'20

,

201

,

249

',

'27

',

'10

,

439

,

388

',

'13

',

'779

,

094

',

'1

',

'11

,

799

,

448

',

'16

',

'3

,

959

,

353

',

'5

',

'4

,

237

,

256

',

'5

',

'13

,

002

,

700

',

'18

',

'1

,

097

,

379

',

'2

',

'5

,

118

,

425

',

'7

',

'886

,

667

',

'1

',

'6

,

910

,

840

',

'9

',

'29

,

145

,

505

',

'36

',

'3

,

271

,

616

',

'4

',

'643

,

077

',

'1

',

'8

,

631

,

393

',

'11

',

'7

,

705

,

281

',

'10

',

'1

,

793

,

716

',

'3

',

'5

,

893

,

718

',

'8

',

'576

,

851

',

'1

']

We will now remove the unwanted strings in between.

1

2

3

4

5

pop_final = []

for

i

in

pop:

if

len(i) >

3

: pop_final.append(i) print(pop_final)

And the final result goes here:

1

['

5

,

024

,

279

', '

733

,

391

', '

7

,

151

,

502

', '

3

,

011

,

524

', '

39

,

538

,

223

', '

5

,

773

,

714

', '

3

,

605

,

944

', '

989

,

948

', '

21

,

538

,

187

', '

10

,

711

,

908

', '

1

,

455

,

271

', '

1

,

839

,

106

', '

12

,

812

,

508

', '

6

,

785

,

528

', '

3

,

190

,

369

', '

2

,

937

,

880

', '

4

,

505

,

836

', '

4

,

657

,

757

', '

1

,

362

,

359

', '

6

,

177

,

224

', '

7

,

029

,

917

', '

10

,

077

,

331

', '

5

,

706

,

494

', '

2

,

961

,

279

', '

6

,

154

,

913

', '

1

,

084

,

225

', '

1

,

961

,

504

', '

3

,

104

,

614

', '

1

,

377

,

529

', '

9

,

288

,

994

', '

2

,

117

,

522

', '

20

,

201

,

249

', '

10

,

439

,

388

', '

779

,

094

', '

11

,

799

,

448

', '

3

,

959

,

353

', '

4

,

237

,

256

', '

13

,

002

,

700

', '

1

,

097

,

379

', '

5

,

118

,

425

', '

886

,

667

', '

6

,

910

,

840

', '

29

,

145

,

505

', '

3

,

271

,

616

', '

643

,

077

', '

8

,

631

,

393

', '

7

,

705

,

281

', '

1

,

793

,

716

', '

5

,

893

,

718

', '

576

,

851

']

Writing Data to CSV

1

2

3

4

5

6

7

8

import

pandas

as

pd df = pd.DataFrame() df[

'state'

] = states df[

'population'

] = pop_final print(df)

index state population
0 Alabama 5,024,279
1 Alaska 733,391
2 Arizona 7,151,502
3 Arkansas 3,011,524
4 California 39,538,223
5 Colorado 5,773,714
6 Connecticut 3,605,944
7 Delaware 989,948
8 Florida 21,538,187
9 Georgia 10,711,908
10 Hawaii 1,455,271
11 Idaho 1,839,106
12 Illinois 12,812,508
13 Indiana 6,785,528
14 Iowa 3,190,369
15 Kansas 2,937,880
16 Kentucky 4,505,836
17 Louisiana 4,657,757
18 Maine 1,362,359
19 Maryland 6,177,224
20 Massachusetts 7,029,917
21 Michigan 10,077,331
22 Minnesota 5,706,494
23 Mississippi 2,961,279
24 Missouri 6,154,913
25 Montana 1,084,225
26 Nebraska 1,961,504
27 Nevada 3,104,614
28 New Hampshire 1,377,529
29 New Jersey 9,288,994
30 New Mexico 2,117,522
31 New York 20,201,249
32 North Carolina 10,439,388
33 North Dakota 779,094
34 Ohio 11,799,448
35 Oklahoma 3,959,353
36 Oregon 4,237,256
37 Pennsylvania 13,002,700
38 Rhode Island 1,097,379
39 South Carolina 5,118,425
40 South Dakota 886,667
41 Tennessee 6,910,840
42 Texas 29,145,505
43 Utah 3,271,616
44 Vermont 643,077
45 Virginia 8,631,393
46 Washington 7,705,281
47 West Virginia 1,793,716
48 Wisconsin 5,893,718
49 Wyoming 576,851

We then write the data frame to the CSV file using the line of code below.

df.to_csv('us_info.csv')

image

Conclusion:

Beautiful Soup is easy to learn and beginner-friendly. In this article we have completed the basics of web scraping using Beautiful Soup and also did a sample project to better understand the concepts. In short, the request library allows you to fetch static HTML content from the Internet and the Beautiful Soup package allows you to parse the HTML using a parser. However, there are many more advanced, interesting concepts to learn regarding this topic. You can find the documentation here.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Click to show preference!

Click to show preference!

Bài viết cùng chủ đề:

Trả lời

Email của bạn sẽ không được hiển thị công khai.