You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,11 +28,11 @@ Google Chrome is currently the most popular browser, and many others use the sam
28
28
29
29
Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press **F12**, or right-click anywhere on the page and select **Inspect**.
30
30
31
-

31
+

32
32
33
33
Websites are built with three main technologies: HTML, CSS, and JavaScript. In the **Elements** tab, DevTools shows the HTML and CSS of the current page:
34
34
35
-

35
+

36
36
37
37
:::warning Screen adaptations
38
38
@@ -62,17 +62,17 @@ While HTML and CSS describe what the browser should display, JavaScript adds int
62
62
63
63
If you don't see it, press <kbd>ESC</kbd> to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly.
64
64
65
-

65
+

66
66
67
67
## Selecting an element
68
68
69
69
In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square.
70
70
71
-

71
+

72
72
73
73
We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle.
74
74
75
-

75
+

76
76
77
77
The highlighted section should look something like this:
78
78
@@ -108,7 +108,7 @@ We won't be creating Node.js scrapers just yet. Let's first get familiar with wh
108
108
109
109
In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready.
110
110
111
-

111
+

112
112
113
113
The Console allows us to run code in the context of the loaded page. We can use it to play around with elements.
When we change elements in the Console, those changes reflect immediately on the page!
134
134
135
-

135
+

136
136
137
137
But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence.
138
138
@@ -161,7 +161,7 @@ You're looking for an [`img`](https://developer.mozilla.org/en-US/docs/Web/HTML/
161
161
1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
162
162
1. In the console, type `temp1.src` and hit **Enter**.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,17 +30,17 @@ That said, we designed all the additional exercises to work with live websites.
30
30
31
31
As mentioned in the previous lesson, before building a scraper, we need to understand structure of the target page and identify the specific elements our program should extract. Let's figure out how to select details for each product on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales).
32
32
33
-

33
+

34
34
35
35
The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it.
36
36
37
-

37
+

38
38
39
39
Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more.
40
40
41
41
In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**.
42
42
43
-

43
+

44
44
45
45
At this stage, we could use the **Store as global variable** option to send the element to the **Console**. While helpful for manual inspection, this isn't something a program can do.
It will return the HTML element for the first product card in the listing:
66
66
67
-

67
+

68
68
69
69
CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine.
70
70
@@ -114,13 +114,13 @@ The product card has four classes: `product-item`, `product-item--vertical`, `1/
114
114
115
115
This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
116
116
117
-

117
+

118
118
119
119
## Locating all product cards
120
120
121
121
In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list.
122
122
123
-

123
+

124
124
125
125
But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**:
126
126
@@ -132,7 +132,7 @@ The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/We
132
132
133
133
We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer!
134
134
135
-

135
+

136
136
137
137
To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like with regular JavaScript arrays:
138
138
@@ -151,7 +151,7 @@ Even though we're just playing in the browser's **Console**, we're inching close
151
151
152
152
On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use CSS selectors in the **Console** to list the HTML elements representing headings of the colored boxes (including the grey ones).
153
153
154
-

154
+

155
155
156
156
<details>
157
157
<summary>Solution</summary>
@@ -169,7 +169,7 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use
169
169
170
170
Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) category. In the **Console**, use CSS selectors to list all HTML elements representing the products.
171
171
172
-

172
+

173
173
174
174
<details>
175
175
<summary>Solution</summary>
@@ -194,7 +194,7 @@ Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs
194
194
195
195
:::
196
196
197
-

197
+

Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,15 +31,15 @@ subwoofer.textContent;
31
31
32
32
That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces.
33
33
34
-

34
+

35
35
36
36
We'll need to first locate relevant child elements and extract the data from each of them individually.
37
37
38
38
## Extracting title
39
39
40
40
We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element.
Browser JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element:
45
45
@@ -50,13 +50,13 @@ title.textContent;
50
50
51
51
Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title:
To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class.
We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the later and we'll let `querySelector()` to simply return the first result:
62
62
@@ -67,7 +67,7 @@ price.textContent;
67
67
68
68
It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**:
But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Node.js, we'll figure out how to get the values as numbers.
On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selectors and HTML element manipulation in the **Console** to extract the name of the top wiki. Use the [`trim()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim) method to remove white space around the name.
102
102
103
-

103
+

104
104
105
105
<details>
106
106
<summary>Solution</summary>
@@ -119,7 +119,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto
119
119
120
120
On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ import Exercises from '../scraping_basics/_exercises.mdx';
14
14
15
15
From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`.
16
16
17
-

17
+

18
18
19
19
As a first step, let's try counting how many products are on the listing page.
20
20
@@ -50,7 +50,7 @@ Being comfortable around installing Node.js packages is a prerequisite of this c
50
50
51
51
Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
52
52
53
-

53
+

The program should now also produce a `data.csv` file. When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have.
180
180
181
-

181
+

182
182
183
183
In the CSV format, if a value contains commas, we should enclose it in quotes. If it contains quotes, we should double them. When we open the file in a text editor of our choice, we can see that the library automatically handled this:
184
184
@@ -232,6 +232,6 @@ Open the `products.csv` file we created in the lesson using a spreadsheet applic
232
232
1. Select the header row. Go to **Data > Create filter**.
233
233
1. Use the filter icon that appears next to `minPrice`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
234
234
235
-

235
+

Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -205,15 +205,15 @@ The program is much easier to read now. With the `parseProduct()` function handy
205
205
206
206
We turned the whole program upside down, and at the same time, we didn't make any actual changes! This is [refactoring](https://en.wikipedia.org/wiki/Code_refactoring): improving the structure of existing code without changing its behavior.
With everything in place, we can now start working on a scraper that also scrapes the product pages. For that, we'll need the links to those pages. Let's open the browser DevTools and remind ourselves of the structure of a single product item:
Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more.
Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure:
88
88
@@ -197,7 +197,7 @@ Scraping the vendor's name is nice, but the main reason we started checking the
197
197
198
198
Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs…
199
199
200
-

200
+

201
201
202
202
In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset.
0 commit comments