Web Crawler Project

Warning

THIS README WAS GENERATED BY CLAUDE SONNET 4 AND HAS BEEN DOUBLE CHECKED BY 4UFFIN.

THE REST OF THIS PROJECT WAS CREATED USING GOOGLE GEMINI 2.5 FLASH. EXPECT BUGS AND USE WITH CAUTION!

To view what's been crawled click here

Web Crawler Project

An automated web crawling system that discovers URLs from target websites and extracts their plain text content using GitHub Actions.

Overview

This project consists of two main components:

URL Discovery: Automatically finds new URLs from configured target websites
Web Crawling: Extracts plain text content from discovered URLs and saves them locally

Both processes run automatically on a schedule using GitHub Actions, with the ability to trigger them manually.

Features

🔄 Automated URL Discovery: Scans target websites for new links matching specified prefixes
📄 Plain Text Extraction: Converts HTML content to clean, readable plain text
🚀 GitHub Actions Integration: Runs automatically every hour with manual trigger support
📁 Change Detection: Only saves content when changes are detected to minimize commits
🛡️ Error Handling: Continues operation even if individual URLs fail
🔧 Configurable Targets: Easy to add new websites to crawl

Project Structure

├── .github/workflows/
│   ├── find-urls.yml      # GitHub Action for URL discovery
│   └── crawl.yml          # GitHub Action for web crawling
├── crawled_output/        # Directory containing extracted text files
├── find_urls.py          # URL discovery script
├── crawl.py              # Web crawling script
├── urls.txt              # List of URLs to crawl
├── requirements.txt      # Python dependencies
└── README.md            # This file

How It Works

URL Discovery Process

Scans configured target websites (currently Wikipedia and kernel.org)
Extracts links that match specified URL prefixes
Adds new, unique URLs to urls.txt
Commits changes if new URLs are found

Web Crawling Process

Reads URLs from urls.txt
Fetches each URL and extracts plain text content
Saves content to individual .txt files in crawled_output/
Only saves files when content has changed
Commits changes if any files were updated

Configuration

Adding New Target Websites

Edit the TARGET_SITES dictionary in find_urls.py:

TARGET_SITES = {
    "https://en.wikipedia.org/wiki/Main_Page": "https://en.wikipedia.org/wiki/",
    "https://kernel.org/": "https://www.kernel.org/category/",
    "https://example.com/": "https://example.com/section/"  # Add new sites here
}

Schedule Configuration

Both workflows run hourly by default. To change the schedule, modify the cron expression in the workflow files:

schedule:
  - cron: '0 * * * *'  # Every hour

Setup Instructions

Clone the repository

git clone <your-repo-url>
cd <repo-name>

Install dependencies locally (optional)
```
pip install -r requirements.txt
```
Configure target websites (optional)
- Edit TARGET_SITES in find_urls.py to add new websites
Enable GitHub Actions
- Ensure Actions are enabled in your repository settings
- The workflows will start running automatically

Manual Execution

Run URL Discovery

python find_urls.py

Run Web Crawling

python crawl.py

Trigger GitHub Actions Manually

Go to the "Actions" tab in your GitHub repository
Select either "Find New URLs" or "Web Crawler"
Click "Run workflow"

Dependencies

requests: HTTP library for fetching web pages
beautifulsoup4: HTML parsing library for extracting content and links

GitHub Actions Permissions

Both workflows require the following permissions:

contents: write - To commit and push changes to the repository

Output

URLs: New discovered URLs are appended to urls.txt
Content: Plain text content is saved as .txt files in crawled_output/
Commits: Automated commits are made when changes are detected

Error Handling

Individual URL failures don't stop the entire process
Network timeouts are handled gracefully
Invalid URLs are skipped with error messages
Workflows use continue-on-error: true to maintain robustness

Current Target Sites

Wikipedia: Discovers article pages from the main page
Kernel.org: Finds category pages and documentation

Notes

The crawler respects basic web etiquette with User-Agent headers
Content is cleaned to remove scripts, styles, and excessive whitespace
Only commits changes when new content is found to keep the repository clean
All URLs and content are stored as plain text for easy version control

Contributing

Fork the repository
Add your target websites to TARGET_SITES
Test the changes locally
Submit a pull request

License

This project is under the MIT License. Visit LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Crawler Project

Overview

Features

Project Structure

How It Works

URL Discovery Process

Web Crawling Process

Configuration

Adding New Target Websites

Schedule Configuration

Setup Instructions

Manual Execution

Run URL Discovery

Run Web Crawling

Trigger GitHub Actions Manually

Dependencies

GitHub Actions Permissions

Output

Error Handling

Current Target Sites

Notes

Contributing

License

About

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 342 Commits
.github/workflows		.github/workflows
crawled_output		crawled_output
LICENSE		LICENSE
README.md		README.md
crawl.py		crawl.py
find_urls.py		find_urls.py
requirements.txt		requirements.txt
urls.txt		urls.txt

License

4uffin/web-scraper-project

Folders and files

Latest commit

History

Repository files navigation

Web Crawler Project

Overview

Features

Project Structure

How It Works

URL Discovery Process

Web Crawling Process

Configuration

Adding New Target Websites

Schedule Configuration

Setup Instructions

Manual Execution

Run URL Discovery

Run Web Crawling

Trigger GitHub Actions Manually

Dependencies

GitHub Actions Permissions

Output

Error Handling

Current Target Sites

Notes

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages