- Notifications
You must be signed in to change notification settings - Fork 18
Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py
Notifications You must be signed in to change notification settings
arbuckle/python_crawler
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Latest commit | ||||
Repository files navigation
Crawl.py Crawl.py is a threaded web crawler that crawls a specified domain (or collection of domains) and collects metadata about each page it visits. This crawler can be used to analyze the performance of a web server, analyze relationships between pages, or traverse a site and scrape pages. Data collected includes: - url & parsed url (scheme/netloc/path/etc...) - page load time in milliseconds - page size in bytes - link addresses on the page - number of links on the page - number of links within the page domain - number of links targeting external domains Links on each page are recorded in the 'url_canonical' table. Visits to each link are recorded in the 'visit_metadata' table. Relationships between a visited link and all the links in the response are recorded in the 'page_rel' table. Planned functionality: - addition of an optional 'scrape page' function - global toggle to allow already-visited URLs to be requeued. - usage guidelines and documentation Known issues: - URL encoding is sometimes incorrect, which may cause duplication issues and does result in errors opening requests
About
Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py
Resources
Uh oh!
There was an error while loading. Please reload this page.
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published