AntNest is a simple, clear and fast Web Crawler framework build on python3.6+, powered by asyncio. It has only 600+ lines core code now(thanks powerful lib like aiohttp, lxml and other else).
- Useful http client out of box
- Easy pipelines, in async or not
- Easy item extractor, define data detail(by xpath, jpath or regex) and extract from html, json or strings.
- Easy async work flow, build in async task pool
pip install ant_nest
Create one demo project:
>>> ant_nest -c examples
Then we get a project:
drwxr-xr-x 5 bruce staff 160 Jun 30 18:24 ants -rw-r--r-- 1 bruce staff 208 Jun 26 22:59 settings.py
Presume we want to get hot repos from github, let`s create "examples/ants/example2.py":
from yarl import URL from ant_nest.ant import Ant from ant_nest.pipelines import ItemFieldReplacePipeline from ant_nest.items import Extractor class GithubAnt(Ant): """Crawl trending repositories from github""" item_pipelines = [ ItemFieldReplacePipeline( ("meta_content", "star", "fork"), excess_chars=("\r", "\n", "\t", " ") ) ] def __init__(self): super().__init__() self.item_extractor = Extractor(dict) self.item_extractor.add_extractor( "title", lambda x: x.html_element.xpath("//h1/strong/a/text()")[0] ) self.item_extractor.add_extractor( "author", lambda x: x.html_element.xpath("//h1/span/a/text()")[0] ) self.item_extractor.add_extractor( "meta_content", lambda x: "".join( x.html_element.xpath( '//div[@class="repository-content "]/div[2]//text()' ) ), ) self.item_extractor.add_extractor( "star", lambda x: x.html_element.xpath( '//a[@class="social-count js-social-count"]/text()' )[0], ) self.item_extractor.add_extractor( "fork", lambda x: x.html_element.xpath('//a[@class="social-count"]/text()')[0], ) self.item_extractor.add_extractor("origin_url", lambda x: str(x.url)) async def crawl_repo(self, url): """Crawl information from one repo""" response = await self.request(url) # extract item from response item = self.item_extractor.extract(response) item["origin_url"] = response.url await self.collect(item) # let item go through pipelines(be cleaned) self.logger.info("*" * 70 + "I got one hot repo!\n" + str(item)) async def run(self): """App entrance, our play ground""" response = await self.request("https://github.com/explore") for url in response.html_element.xpath( "/html/body/div[4]/main/div[2]/div/div[2]/div[1]/article/div/div[1]/h1/a[2]/" "@href" ): # crawl many repos with our coroutines pool self.schedule_task(self.crawl_repo(response.url.join(URL(url)))) self.logger.info("Waiting...")
Then we can list all ants we defined (in "examples")
>>> $ant_nest -l ants.example2.GithubAnt
Run it! (without debug log):
>>> ant_nest -a ants.example2.GithubAnt INFO:GithubAnt:Opening INFO:GithubAnt:Waiting... INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'app-ideas', 'author': 'florinpop17', 'meta_content': 'A Collection of application ideas which can be used to improve your coding skills.', 'star': '11.7k', 'fork': '500', 'origin_url': URL('https://github.com/florinpop17/app-ideas')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'Carbon', 'author': 'briannesbitt', 'meta_content': 'A simple PHP API extension for DateTime.https://carbon.nesbot.com/', 'star': '14k', 'fork': '249', 'origin_url': URL('https://github.com/briannesbitt/Carbon')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'org-roam', 'author': 'jethrokuan', 'meta_content': 'Rudimentary Roam replica with Org-modehttps://org-roam.readthedocs.io/en/la…', 'star': '261', 'fork': '27', 'origin_url': URL('https://github.com/jethrokuan/org-roam')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'joplin', 'author': 'laurent22', 'meta_content': 'Joplin - an open source note taking and to-do application with synchronization capabilities for Windows, macOS, Linux, Android and iOS. Forum: https://discourse.joplinapp.org/https://joplinapp.org', 'star': '13k', 'fork': '335', 'origin_url': URL('https://github.com/laurent22/joplin')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'snoop', 'author': 'snooppr', 'meta_content': 'Snoop — инструмент разведки на основе открытых данных', 'star': '281', 'fork': '9', 'origin_url': URL('https://github.com/snooppr/snoop')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': '1on1-questions', 'author': 'VGraupera', 'meta_content': 'Mega list of 1 on 1 meeting questions compiled from a variety to sources', 'star': '4k', 'fork': '93', 'origin_url': URL('https://github.com/VGraupera/1on1-questions')} INFO:GithubAnt:Get 8 Request in total with 8/60s rate INFO:GithubAnt:Get 7 Response in total with 7/60s rate INFO:GithubAnt:Get 6 dict in total with 6/60s rate INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'python-small-examples', 'author': 'jackzhenguo', 'meta_content': 'Python有趣的小例子一网打尽。Python基础、Python坑点、Python字符串和正则、Python绘图、Python日期和文件、Web开发、数据科学、机器学习、深度2.4k', 'fork': '102', 'origin_url': URL('https://github.com/jackzhenguo/python-small-examples')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'system-design-primer', 'author': 'donnemartin', 'meta_content': 'Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.', 'star': '83.2k', 'fork': '4.4k', 'origin_url': URL('https://github.com/donnemartin/system-design-primer')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'awesome-scalability', 'author': 'binhnguyennus', 'meta_content': 'The Patterns of Scalable, Reliable, and Performant Large-Scale Systemshttp://awesome-scalability.com/', 'star': '24.5k', 'fork': '1.4k', 'origin_url': URL('https://github.com/binhnguyennus/awesome-scalability')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'gdb-frontend', 'author': 'rohanrhu', 'meta_content': '☕ GDBFrontend is an easy, flexible and extensionable gui debugger.https://oguzhaneroglu.com/projects/gd…', 'star': '716', 'fork': '14', 'origin_url': URL('https://github.com/rohanrhu/gdb-frontend')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'Complete-Python-3-Bootcamp', 'author': 'Pierian-Data', 'meta_content': 'Course Files for Complete Python 3 Bootcamp Course on Udemy', 'star': '8.1k', 'fork': '1.8k', 'origin_url': URL('https://github.com/Pierian-Data/Complete-Python-3-Bootcamp')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'leon', 'author': 'leon-ai', 'meta_content': '\U0001f9e0 Leon is your open-source personal assistant.https://getleon.ai', 'star': '6.3k', 'fork': '147', 'origin_url': URL('https://github.com/leon-ai/leon')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'esbuild', 'author': 'evanw', 'meta_content': 'An extremely fast JavaScript bundler and minifier', 'star': '2.3k', 'fork': '38', 'origin_url': URL('https://github.com/evanw/esbuild')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'wearable-microphone-jamming', 'author': 'y-x-c', 'meta_content': 'Repository for our paper Wearable Microphone Jamminghttp://sandlab.cs.uchicago.edu/jammer/', 'star': '138', 'fork': '10', 'origin_url': URL('https://github.com/y-x-c/wearable-microphone-jamming')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'efcore', 'author': 'dotnet', 'meta_content': 'EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.https://docs.microsoft.com/ef/core/', 'star': '8.7k', 'fork': '965', 'origin_url': URL('https://github.com/dotnet/efcore')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'playwright', 'author': 'microsoft', 'meta_content': 'Node library to automate Chromium, Firefox and WebKit with a single APIhttps://www.npmjs.com/package/playwright', 'star': '9k', 'fork': '92', 'origin_url': URL('https://github.com/microsoft/playwright')} INFO:GithubAnt:Get 18 Request in total with 10/60s rate INFO:GithubAnt:Get 17 Response in total with 10/60s rate INFO:GithubAnt:Get 16 dict in total with 10/60s rate INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'degoogle', 'author': 'tycrek', 'meta_content': 'A huge list of alternatives to Google products. Privacy tips, tricks, and links.https://degoogle.jmoore.dev', 'star': '2k', 'fork': '50', 'origin_url': URL('https://github.com/tycrek/degoogle')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'sherlock', 'author': 'sherlock-project', 'meta_content': '🔎 Hunt down social media accounts by username across social networkshttp://sherlock-project.github.io', 'star': '10.4k', 'fork': '207', 'origin_url': URL('https://github.com/sherlock-project/sherlock')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'the-art-of-command-line', 'author': 'jlevy', 'meta_content': 'Master the command line, in one page', 'star': '68.9k', 'fork': '2.2k', 'origin_url': URL('https://github.com/jlevy/the-art-of-command-line')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'freespeech', 'author': 'Merkie', 'meta_content': 'A free program designed to help the non-verbal.', 'star': '168', 'fork': '20', 'origin_url': URL('https://github.com/Merkie/freespeech')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'awesome-pentest', 'author': 'enaqx', 'meta_content': 'A collection of awesome penetration testing resources, tools and other shiny things', 'star': '11.4k', 'fork': '1k', 'origin_url': URL('https://github.com/enaqx/awesome-pentest')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'trax', 'author': 'google', 'meta_content': 'Trax — your path to advanced deep learning', 'star': '2.7k', 'fork': '90', 'origin_url': URL('https://github.com/google/trax')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'introtodeeplearning', 'author': 'aamini', 'meta_content': 'Lab Materials for MIT 6.S191: Introduction to Deep Learning', 'star': '1.6k', 'fork': '116', 'origin_url': URL('https://github.com/aamini/introtodeeplearning')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'CleanArchitecture', 'author': 'ardalis', 'meta_content': 'A starting point for Clean Architecture with ASP.NET Core', 'star': '3.8k', 'fork': '300', 'origin_url': URL('https://github.com/ardalis/CleanArchitecture')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': '3y', 'author': 'ZhongFuCheng3y', 'meta_content': '📓从Java基础、JavaWeb基础到常用的框架再到面试题都有完整的教程,几乎涵盖了Java后端必备的知识点', 'star': '5.1k', 'fork': '285', 'origin_url': URL('https://github.com/ZhongFuCheng3y/3y')} INFO:GithubAnt:Closed INFO:GithubAnt:Get 26 Request in total INFO:GithubAnt:Get 26 Response in total INFO:GithubAnt:Get 25 dict in total INFO:GithubAnt:Run GithubAnt in 180.234251 seconds
We use dict to store one item in examples, actually it support many way: dict, normal class, atrrs's class, data class and ORM class, it depend on your need and choice.
You can get some example in "./examples"
- Complex exception handle
one coroutine's exception will break await chain especially in a loop, unless we handle it by hand. eg:
for cor in self.as_completed((self.crawl(url) for url in self.urls)): try: await cor except Exception: # may raise many exception in a await chain pass
but we can use "self.as_completed_with_async" now, eg:
async fo result in self.as_completed_with_async( self.crawl(url) for url in self.urls, raise_exception=False): # exception in "self.crawl(url)" will be passed and logged automatic self.handle(result)
- High memory usage
It`s a "feature" that asyncio eat large memory especially with high concurrent IO, we can set a concurrent limit("connection_limit" or "concurrent_limit") simply, but it`s complex to get the balance between performance and limit.
Follow "Flake8", Format by "Black", typing check by "MyPy", sea Makefile for more detail.
[*] Log system [*] Nest item extractor [ ] Docs