Skip to content

Conversation

honzajavorek
Copy link
Collaborator

Part of #1584, fixes #1648

⚠️ 🐍 Includes respective changes also to the original scraping_basics_python lesson

@honzajavorek honzajavorek requested review from gullmar and TC-MO August 5, 2025 12:19
@honzajavorek honzajavorek added the t-academy Issues related to Web Scraping and Apify academies. label Aug 5, 2025
@apify-service-account
Copy link

Preview for this PR was built for commit 0a2a15e2 and is ready at https://pr-1760.preview.docs.apify.com!

@apify-service-account
Copy link

Preview for this PR was built for commit 79e99dce and is ready at https://pr-1760.preview.docs.apify.com!

@honzajavorek honzajavorek force-pushed the honzajavorek/js-parsing branch from 79e99dc to 0f57af5 Compare August 5, 2025 12:37
Copy link

cursor bot commented Aug 5, 2025

🚨 Bugbot Trial Expired

Your team's Bugbot trial has expired. Please contact your team administrator to turn on the paid plan to continue using Bugbot.

A team admin can activate the plan in the Cursor dashboard.

@apify-service-account
Copy link

Preview for this PR was built for commit 0f57af5 and is ready at https://pr-1760.preview.docs.apify.com!


:::info Why regex can't parse HTML

While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we break up this note visually?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of links so it might look worse in the diff than rendered, but I'll try to figure out something.
Screenshot 2025-08-25 at 16 04 43

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my take
Screenshot 2025-08-25 at 16 20 08

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed in the Python course as well. The change is in a3bc420

We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies.
We'll choose [Cheerio](https://cheerio.js.org/) as our parser, as it's a popular library which can process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. In the project directory, we'll run the following to install the Cheerio package:

```text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't ```bash make more sense? (I guess that is just a nit )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use ```text consistently throughout both of the courses, because setting it as bash oftentimes than not highlights words which I don't need colored, especially in the output of the commands. Usually, there's like one line of bash and the rest is the output, so I opted for setting it as text, because the result is visually less confusing, imho.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ```bash would be great for actual bash scripts, bash functions, etc.

@apify-service-account
Copy link

Preview for this PR was built for commit c187a9b0 and is ready at https://pr-1760.preview.docs.apify.com!

@honzajavorek honzajavorek force-pushed the honzajavorek/js-parsing branch from 5bc11e4 to a3bc420 Compare August 25, 2025 14:21
@apify-service-account
Copy link

Preview for this PR was built for commit a3bc420 and is ready at https://pr-1760.preview.docs.apify.com!

@honzajavorek honzajavorek requested a review from TC-MO August 25, 2025 14:25
@apify-service-account
Copy link

Preview for this PR was built for commit a3bc420 and is ready at https://pr-1760.preview.docs.apify.com!

@honzajavorek honzajavorek merged commit 30d0148 into master Aug 25, 2025
10 checks passed
@honzajavorek honzajavorek deleted the honzajavorek/js-parsing branch August 25, 2025 16:34
daveomri pushed a commit to daveomri/apify-docs that referenced this pull request Sep 3, 2025
…ipt (apify#1760)

Part of apify#1584, fixes
apify#1648

⚠️ 🐍 Includes respective changes also to the original
`scraping_basics_python` lesson

---------

Co-authored-by: Michał Olender <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-academy Issues related to Web Scraping and Apify academies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

F1 changed website, few Apify Academy exercises are broken
4 participants