Warning
THIS README WAS GENERATED BY CLAUDE SONNET 4 AND HAS BEEN DOUBLE CHECKED BY 4UFFIN.
THE REST OF THIS PROJECT WAS CREATED USING GOOGLE GEMINI 2.5 FLASH. EXPECT BUGS AND USE WITH CAUTION!
To view what's been crawled click here
An automated web crawling system that discovers URLs from target websites and extracts their plain text content using GitHub Actions.
This project consists of two main components:
- URL Discovery: Automatically finds new URLs from configured target websites
- Web Crawling: Extracts plain text content from discovered URLs and saves them locally
Both processes run automatically on a schedule using GitHub Actions, with the ability to trigger them manually.
- 🔄 Automated URL Discovery: Scans target websites for new links matching specified prefixes
- 📄 Plain Text Extraction: Converts HTML content to clean, readable plain text
- 🚀 GitHub Actions Integration: Runs automatically every hour with manual trigger support
- 📁 Change Detection: Only saves content when changes are detected to minimize commits
- 🛡️ Error Handling: Continues operation even if individual URLs fail
- 🔧 Configurable Targets: Easy to add new websites to crawl
├── .github/workflows/
│ ├── find-urls.yml # GitHub Action for URL discovery
│ └── crawl.yml # GitHub Action for web crawling
├── crawled_output/ # Directory containing extracted text files
├── find_urls.py # URL discovery script
├── crawl.py # Web crawling script
├── urls.txt # List of URLs to crawl
├── requirements.txt # Python dependencies
└── README.md # This file
- Scans configured target websites (currently Wikipedia and kernel.org)
- Extracts links that match specified URL prefixes
- Adds new, unique URLs to
urls.txt
- Commits changes if new URLs are found
- Reads URLs from
urls.txt
- Fetches each URL and extracts plain text content
- Saves content to individual
.txt
files incrawled_output/
- Only saves files when content has changed
- Commits changes if any files were updated
Edit the TARGET_SITES
dictionary in find_urls.py
:
TARGET_SITES = {
"https://en.wikipedia.org/wiki/Main_Page": "https://en.wikipedia.org/wiki/",
"https://kernel.org/": "https://www.kernel.org/category/",
"https://example.com/": "https://example.com/section/" # Add new sites here
}
Both workflows run hourly by default. To change the schedule, modify the cron
expression in the workflow files:
schedule:
- cron: '0 * * * *' # Every hour
-
Clone the repository
git clone <your-repo-url> cd <repo-name>
-
Install dependencies locally (optional)
pip install -r requirements.txt
-
Configure target websites (optional)
- Edit
TARGET_SITES
infind_urls.py
to add new websites
- Edit
-
Enable GitHub Actions
- Ensure Actions are enabled in your repository settings
- The workflows will start running automatically
python find_urls.py
python crawl.py
- Go to the "Actions" tab in your GitHub repository
- Select either "Find New URLs" or "Web Crawler"
- Click "Run workflow"
requests
: HTTP library for fetching web pagesbeautifulsoup4
: HTML parsing library for extracting content and links
Both workflows require the following permissions:
contents: write
- To commit and push changes to the repository
- URLs: New discovered URLs are appended to
urls.txt
- Content: Plain text content is saved as
.txt
files incrawled_output/
- Commits: Automated commits are made when changes are detected
- Individual URL failures don't stop the entire process
- Network timeouts are handled gracefully
- Invalid URLs are skipped with error messages
- Workflows use
continue-on-error: true
to maintain robustness
- Wikipedia: Discovers article pages from the main page
- Kernel.org: Finds category pages and documentation
- The crawler respects basic web etiquette with User-Agent headers
- Content is cleaned to remove scripts, styles, and excessive whitespace
- Only commits changes when new content is found to keep the repository clean
- All URLs and content are stored as plain text for easy version control
- Fork the repository
- Add your target websites to
TARGET_SITES
- Test the changes locally
- Submit a pull request
This project is under the MIT License. Visit LICENSE for more information.