Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -159,12 +159,26 @@ Great! Only if we didn't overlook an important pitfall called [floating-point er
0.30000000000000004
```

These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid floating point numbers when working with money. Let's instead use Python's built-in [`Decimal()`](https://docs.python.org/3/library/decimal.html) type:
These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid floating point numbers when working with money. We won't store dollars, but cents:

```py
price_text = (
product
.select_one(".price")
.contents[-1]
.strip()
.replace("$", "")
# highlight-next-line
.replace(".", "")
.replace(",", "")
)
```

In this case, removing the dot from the price text is the same as if we multiplied all the numbers with 100, effectively converting dollars to cents. For converting the text to a number we'll use `int()` instead of `float()`. This is how the whole program looks like now:

```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal

url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
response = httpx.get(url)
Expand All @@ -182,13 +196,14 @@ for product in soup.select(".product-item"):
.contents[-1]
.strip()
.replace("$", "")
.replace(".", "")
.replace(",", "")
)
if price_text.startswith("From "):
min_price = Decimal(price_text.removeprefix("From "))
min_price = int(price_text.removeprefix("From "))
price = None
else:
min_price = Decimal(price_text)
min_price = int(price_text)
price = min_price

print(title, min_price, price, sep=" | ")
Expand All @@ -198,8 +213,8 @@ If we run the code above, we have nice, clean data about all the products!

```text
$ python main.py
JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None
JBL Flip 4 Waterproof Portable Bluetooth Speaker | 7495 | 7495
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 139800 | None
...
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ Producing results line by line is an efficient approach to handling large datase
```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal

url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
response = httpx.get(url)
Expand All @@ -49,13 +48,14 @@ for product in soup.select(".product-item"):
.contents[-1]
.strip()
.replace("$", "")
.replace(".", "")
.replace(",", "")
)
if price_text.startswith("From "):
min_price = Decimal(price_text.removeprefix("From "))
min_price = int(price_text.removeprefix("From "))
price = None
else:
min_price = Decimal(price_text)
min_price = int(price_text)
price = min_price

# highlight-next-line
Expand All @@ -69,7 +69,7 @@ Before looping over the products, we prepare an empty list. Then, instead of pri

```text
$ python main.py
[{'title': 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', 'min_price': Decimal('74.95'), 'price': Decimal('74.95')}, {'title': 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', 'min_price': Decimal('1398.00'), 'price': None}, ...]
[{'title': 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', 'min_price': 7495, 'price': 7495}, {'title': 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', 'min_price': 139800, 'price': None}, ...]
```

:::tip Pretty print
Expand All @@ -87,7 +87,6 @@ In Python, we can read and write JSON using the [`json`](https://docs.python.org
```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
# highlight-next-line
import json
```
Expand All @@ -99,39 +98,17 @@ with open("products.json", "w") as file:
json.dump(data, file)
```

That's it! If we run the program now, it should also create a `products.json` file in the current working directory:

```text
$ python main.py
Traceback (most recent call last):
...
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Decimal is not JSON serializable
```

Ouch! JSON supports integers and floating-point numbers, but there's no guidance on how to handle `Decimal`. To maintain precision, it's common to store monetary values as strings in JSON files. But this is a convention, not a standard, so we need to handle it manually. We'll pass a custom function to `json.dump()` to serialize objects that it can't handle directly:

```py
def serialize(obj):
if isinstance(obj, Decimal):
return str(obj)
raise TypeError("Object not JSON serializable")

with open("products.json", "w") as file:
json.dump(data, file, default=serialize)
```

If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
That's it! If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:

<!-- eslint-skip -->
```json title=products.json
[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null}, ...]
[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "7495", "price": "7495"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "139800", "price": null}, ...]
```

If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash:

```json
{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "158.00", "price": "158.00"}
{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "15800", "price": "15800"}
```

:::tip Pretty JSON
Expand Down Expand Up @@ -177,7 +154,6 @@ Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we
```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import json
# highlight-next-line
import csv
Expand All @@ -186,13 +162,8 @@ import csv
Next, let's add one more data export to end of the source code of our scraper:

```py
def serialize(obj):
if isinstance(obj, Decimal):
return str(obj)
raise TypeError("Object not JSON serializable")

with open("products.json", "w") as file:
json.dump(data, file, default=serialize)
json.dump(data, file)

with open("products.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
Expand Down Expand Up @@ -223,13 +194,12 @@ Write a new Python program that reads `products.json`, finds all products with a
```py
import json
from pprint import pp
from decimal import Decimal

with open("products.json", "r") as file:
products = json.load(file)

for product in products:
if Decimal(product["min_price"]) > 500:
if int(product["min_price"]) > 500:
pp(product)
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ Over the course of the previous lessons, the code of our program grew to almost
```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import json
import csv

Expand All @@ -54,24 +53,20 @@ for product in soup.select(".product-item"):
.contents[-1]
.strip()
.replace("$", "")
.replace(".", "")
.replace(",", "")
)
if price_text.startswith("From "):
min_price = Decimal(price_text.removeprefix("From "))
min_price = int(price_text.removeprefix("From "))
price = None
else:
min_price = Decimal(price_text)
min_price = int(price_text)
price = min_price

data.append({"title": title, "min_price": min_price, "price": price})

def serialize(obj):
if isinstance(obj, Decimal):
return str(obj)
raise TypeError("Object not JSON serializable")

with open("products.json", "w") as file:
json.dump(data, file, default=serialize)
json.dump(data, file)

with open("products.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
Expand Down Expand Up @@ -103,13 +98,14 @@ def parse_product(product):
.contents[-1]
.strip()
.replace("$", "")
.replace(".", "")
.replace(",", "")
)
if price_text.startswith("From "):
min_price = Decimal(price_text.removeprefix("From "))
min_price = int(price_text.removeprefix("From "))
price = None
else:
min_price = Decimal(price_text)
min_price = int(price_text)
price = min_price

return {"title": title, "min_price": min_price, "price": price}
Expand All @@ -119,13 +115,8 @@ Now the JSON export. For better readability of it, let's make a small change her

```py
def export_json(file, data):
def serialize(obj):
if isinstance(obj, Decimal):
return str(obj)
raise TypeError("Object not JSON serializable")

# highlight-next-line
json.dump(data, file, default=serialize, indent=2)
json.dump(data, file, indent=2)
```

The last function we'll add will take care of the CSV export. We'll make a small change here as well. Having to specify the field names is not ideal. What if we add more field names in the parsing function? We'd always have to remember to go and edit the export function as well. If we could figure out the field names in place, we'd remove this dependency. One way would be to infer the field names from the dictionary keys of the first row:
Expand All @@ -151,7 +142,6 @@ Now let's put it all together:
```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import json
import csv

Expand All @@ -171,24 +161,20 @@ def parse_product(product):
.contents[-1]
.strip()
.replace("$", "")
.replace(".", "")
.replace(",", "")
)
if price_text.startswith("From "):
min_price = Decimal(price_text.removeprefix("From "))
min_price = int(price_text.removeprefix("From "))
price = None
else:
min_price = Decimal(price_text)
min_price = int(price_text)
price = min_price

return {"title": title, "min_price": min_price, "price": price}

def export_json(file, data):
def serialize(obj):
if isinstance(obj, Decimal):
return str(obj)
raise TypeError("Object not JSON serializable")

json.dump(data, file, default=serialize, indent=2)
json.dump(data, file, indent=2)

def export_csv(file, data):
fieldnames = list(data[0].keys())
Expand Down Expand Up @@ -254,13 +240,13 @@ In the previous code example, we've also added the URL to the dictionary returne
[
{
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
"min_price": "74.95",
"price": "74.95",
"min_price": "7495",
"price": "7495",
"url": "/products/jbl-flip-4-waterproof-portable-bluetooth-speaker"
},
{
"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",
"min_price": "1398.00",
"min_price": "139800",
"price": null,
"url": "/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv"
},
Expand All @@ -277,7 +263,6 @@ Browsers reading the HTML know the base address and automatically resolve such l
```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import json
import csv
# highlight-next-line
Expand Down Expand Up @@ -319,13 +304,13 @@ When we run the scraper now, we should see full URLs in our exports:
[
{
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
"min_price": "74.95",
"price": "74.95",
"min_price": "7495",
"price": "7495",
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker"
},
{
"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",
"min_price": "1398.00",
"min_price": "139800",
"price": null,
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv"
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ Thanks to the refactoring, we have functions ready for each of the tasks, so we
```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import json
import csv
from urllib.parse import urljoin
Expand All @@ -41,24 +40,20 @@ def parse_product(product, base_url):
.contents[-1]
.strip()
.replace("$", "")
.replace(".", "")
.replace(",", "")
)
if price_text.startswith("From "):
min_price = Decimal(price_text.removeprefix("From "))
min_price = int(price_text.removeprefix("From "))
price = None
else:
min_price = Decimal(price_text)
min_price = int(price_text)
price = min_price

return {"title": title, "min_price": min_price, "price": price, "url": url}

def export_json(file, data):
def serialize(obj):
if isinstance(obj, Decimal):
return str(obj)
raise TypeError("Object not JSON serializable")

json.dump(data, file, default=serialize, indent=2)
json.dump(data, file, indent=2)

def export_csv(file, data):
fieldnames = list(data[0].keys())
Expand Down Expand Up @@ -159,14 +154,14 @@ If we run the program now, it'll take longer to finish since it's making 24 more
[
{
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
"min_price": "74.95",
"price": "74.95",
"min_price": "7495",
"price": "7495",
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker",
"vendor": "JBL"
},
{
"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",
"min_price": "1398.00",
"min_price": "139800",
"price": null,
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv",
"vendor": "Sony"
Expand Down
Loading
Loading