warc-content

Simple warc archive content browser. This tool takes warc archives as input, indexes them and creates a simple web page where you can browse crawled urls in a tree grid.

I personaly will use this tool to locate useless links in crawled pages. For example - calendars, print pages, image generators.

This is how the webpage looks:

Usage

./warccontent.py ~/warcs/*.warc.gz

Wait till data gets indexed and then open http://localhost:8080/ in your browser.

features to add in the future

content size counter
regex tool to test against urls
multiple core support for thos gziped archives

known issues

warc-tools library doesn't handle well large files within archives. Large files can cause MemoryError

License

GPLv3

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
warccontent		warccontent
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
warccontent.png		warccontent.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

warc-content

Usage

features to add in the future

known issues

License

About

Releases

Packages

Languages

License

martinsbalodis/warc-content

Folders and files

Latest commit

History

Repository files navigation

warc-content

Usage

features to add in the future

known issues

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages