GitHub - arbuckle/python_crawler: Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py

arbuckle /python_crawlerPublic

Notifications You must be signed in to change notification settings
Fork 18
Star 20

Python-driven web crawler and scraper. Uses BeautifulSoup to gather all URLs from a target page, and initiates a crawl from a start URL, considering Whitelist/Blacklist criteria that are populated in crawl.py

20 stars 18 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README		README
crawl.py		crawl.py

Repository files navigation

Crawl.py Crawl.py is a threaded web crawler that crawls a specified domain (or collection of domains) and collects metadata about each page it visits. This crawler can be used to analyze the performance of a web server, analyze relationships between pages, or traverse a site and scrape pages. Data collected includes: - url & parsed url (scheme/netloc/path/etc...) - page load time in milliseconds - page size in bytes - link addresses on the page - number of links on the page - number of links within the page domain - number of links targeting external domains Links on each page are recorded in the 'url_canonical' table. Visits to each link are recorded in the 'visit_metadata' table. Relationships between a visited link and all the links in the response are recorded in the 'page_rel' table. Planned functionality: - addition of an optional 'scrape page' function - global toggle to allow already-visited URLs to be requeued. - usage guidelines and documentation Known issues: - URL encoding is sometimes incorrect, which may cause duplication issues and does result in errors opening requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

arbuckle/python_crawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages