-
Notifications
You must be signed in to change notification settings - Fork 127
feat: update the parsing lesson of the JS2 course to be about JavaScript #1760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Preview for this PR was built for commit |
Preview for this PR was built for commit |
79e99dc
to
0f57af5
Compare
🚨 Bugbot Trial ExpiredYour team's Bugbot trial has expired. Please contact your team administrator to turn on the paid plan to continue using Bugbot. A team admin can activate the plan in the Cursor dashboard. |
Preview for this PR was built for commit |
|
||
:::info Why regex can't parse HTML | ||
|
||
While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we break up this note visually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed in the Python course as well. The change is in a3bc420
We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. | ||
We'll choose [Cheerio](https://cheerio.js.org/) as our parser, as it's a popular library which can process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. In the project directory, we'll run the following to install the Cheerio package: | ||
|
||
```text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't ```bash make more sense? (I guess that is just a nit )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use ```text consistently throughout both of the courses, because setting it as bash oftentimes than not highlights words which I don't need colored, especially in the output of the commands. Usually, there's like one line of bash and the rest is the output, so I opted for setting it as text, because the result is visually less confusing, imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ```bash would be great for actual bash scripts, bash functions, etc.
sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md
Show resolved
Hide resolved
Preview for this PR was built for commit |
Co-authored-by: Michał Olender <[email protected]>
5bc11e4
to
a3bc420
Compare
Preview for this PR was built for commit |
Preview for this PR was built for commit |
…ipt (apify#1760) Part of apify#1584, fixes apify#1648⚠️ 🐍 Includes respective changes also to the original `scraping_basics_python` lesson --------- Co-authored-by: Michał Olender <[email protected]>
Part of #1584, fixes #1648
scraping_basics_python
lesson