Understanding the HTML Basics for Web Scraping

Understanding the HTML Basics for Web Scraping

A first step to take before scraping a website using Python

Computer on deskPhoto by Carl Heyerdahl on Unsplash.

The post is the first in a series of tutorials to build scrapers. Below, there is the full series:

  1. HTML basics for web scraping (this post)
  2. Web Scraping with Octoparse
  3. Web Scraping with Selenium
  4. Web Scraping with Beautiful Soup

The purpose of this series is to learn to extract data from websites. Most of the data in websites are in HTML format, then the first tutorial explains the basics of this markup language. The second guide shows a way to scrape data easily using an intuitive web scraping tool, which doesn’t need any knowledge of HTML. Instead, the last tutorials are focused on gathering data with Python from the web. In this case, you need to grasp to interact directly with HTML pages and you need some previous knowledge of it.

I discovered web scraping while working towards my master’s degree in Data Science. It wasn’t one of my courses, but I helped a friend with a project about this topic in her study program. It was hard to understand what basics I needed to solve this enigma. At the same time, the more difficult I found the task, the more compelled I felt to solve the mystery.

What is web scraping? Look at the words. Web refers to a website, while scraping is about the extraction of data. By merging the two words, you can understand the real meaning: extracting the data from websites. There are many languages to do this task. The most used is Python. But to extract the information from a website, only having knowledge of Python won’t let you solve the problem. You also need to know HTML.

In this article, I want to show you the basics of HMTL. It’s not hard to understand, but before you can start web scraping, you need to first master HTML. To extract the right pieces of information, you need to right-click “inspect.” You’ll find a very long HTML code that seems infinite. Don’t worry. You don’t need to know HTML deeply to be able to extract the data. I will alternate theory with examples so you’ll learn quickly.

Table of Contents:

1. Intro to HTML2. Classes and Id3. Tables4. Lists5. Blocks

1. Intro to HTML

HTML stands for HyperText Markup Language. You can deduce that it’s a language for creating web pages. It’s not a programming language like Python or Java, but it’s a markup language. It describes the elements of a page through tags characterized by angle brackets.

If you save it into a file that finishes with .html and you double-click the file stored, you’ll have this result:

Result of HTML page with the three headers

Voilà: Now you have a very short example of HTML’s structure! The principal elements shown are:

  • The document always begins and ends using <html> and </html>.
  • <body></body> constitutes the visible part of HTML document.
  • <h1> to <h3> tags are defined for the headings.

We can also add a brief paragraph under the second header using the <p> tag:

Paragraph in second header

To be more informative, we can add a link on the word “Albert Einstein” to send visitors to the Wikipedia page directly. <a> is the tag specialized for HTML links. It has the href attribute to specify the link:

Result of HTML page after adding the link

What could we do as the next activity? We can insert an image of Albert Einstein. The tag for this task is <img>, which has the attribute src to specify the URL of the image. Remember to copy the image address when you right-click the image you want to add to your mini-website.

Result of HTML page after adding the image

2. Classes and ID

You are probably asking yourself why you need the classes and ID. The names alone seem so boring, but once you understand why you need them, you won’t be able to do without them.

  • The id is an attribute to specify a unique ID for an element. For example, you want a particular colour and size for the title in the first header.
  • The class is an attribute to define different elements with the same class name. Why do you need the same class in some elements? Because you would probably want to write some phrases with the same font, colour, and size.

Both IDs and classes are defined in the <style> tags and the properties are defined in the curly braces ({}). The syntax for the class needs the period (.) followed by the name of the class, while the ID needs the hashtag (#) followed by the name of the ID. Once you create the class and the ID in the <style> tags, you need to pass them in the elements you want. In this case, the ID is called in the <h1> tag, while the class is in the <p> tags. I also included the <i> and <b> tags that used to display italic and bold text, respectively.

Result of HTML page with classes and IDs

3. Tables

Another important feature of HTML is the table, which is defined by the <table> tag. Within the <table> tag, there are three principal tags to remember:

  • The <tr> tag is used to build each row of the table.
  • The <th> tag is used to define the header.
  • The <td> tag is used to define the cell within the row.

Let’s see an example to better understand how to build a table:

Result of HTML page after adding the table

You may notice that in the <style> tag, I defined the properties of the <table>, <th>, and <td> tags. I wanted to build a table with a black border and I specified border-collapse collapse to not have a double border.

4. Lists

There are two types of lists that can be defined in HTML. The first one is an unordered list that starts with the <ul> tag, while the other type is an ordered list specified by the <ol> tag.

Each item of both types of the list is specified by the <li> tag. Below, we can see an example:

Results after adding lists

Now you can see a list of Einstein’s discoveries and awards.

5. Blocks

Now I’ll show you the most common elements you can find on a website. These elements are usually called Blocks or Containers. They are useful to group together different elements and apply the same properties. So, the elements we did until now, <h1> to <h3>, <p>, <ul>,<ol>, can form one block together.

For example, we want to divide the page into two parts. To create these two different blocks, I need to specify the <div> tag. In the example, we define a class called “row” to define the structure of the two parts in the same row and a class with the name “column” to specify the properties of each half part of the page.

Moreover, I used the * selector to select all the elements and apply the box-sizing property equivalent to border-box. Then, the element’s total width and height include the padding and the border.

Result after dividing the page into two parts

To specify the background colour of each block, I used the attribute style into the tag <div>. So, now you can see this colourful web page!

Final Thoughts

I hope that this tutorial helped you grasp the principal tags of HTML. There are many other tags, but this quick overview should provide a good starting point.

As I said, web scraping is a task that needs to be divided into subtasks. Once you know this marking language, you can easily use Python libraries such as Beautiful Soup, Scrapy, and Selenium.

Thanks for reading. Have a nice day.

Bài viết cùng chủ đề:

Trả lời

Email của bạn sẽ không được hiển thị công khai.