- Kevin Scannell (http://crubadan.org/languages/ka, CC-BY 4.0)
- National Parliamentary Library of Georgia (http://www.nplg.gov.ge/gwdict/index.php)
- Other Georgian eBooks/websites (Crawler)
Crawler is written on PHP and uses MySQL as a database. Code placed under crawler
folder.
Before running the script should be configured the database and run migrations.
First rename the file .env.example
to .env
and specify database credentials.
Install composer dependencies:
composer install
And run migrations:
composer migrate
This command will crawl urls only inside specified domain and ignore external urls
php cmd crawl --project-name="My Project" --profile=internal "http://www.nplg.gov.ge/gwdict/index.php"
This command will crawl all links
php cmd crawl --project-name="My Project" --profile=all "http://www.nplg.gov.ge/gwdict/index.php"
This command will crawl links with all domains, which end with --domain
php cmd crawl --project-name="My Project" --profile=domain --domain=.ge "http://www.nplg.gov.ge/gwdict/index.php"
Will be crawled links, where url's domain ends with .ge
suffix
This command will crawl all urls if link starts with --subset
php cmd crawl --project-name="My Project" --profile=subset --subset="http://www.nplg.gov.ge/gwdict/index.php?a=list&d=1" "http://www.nplg.gov.ge/gwdict/index.php?a=list&d=1"
Will be crawled links, where url starts with http://www.nplg.gov.ge/gwdict/index.php?a=list&d=1
prefix
You can continue stopped project by command
php cmd crawl --project-id={id}
Show all possible options: php cmd help crawl
- Fix wrong entries and add more words
- Add tests
- Add notification sending on complete
Please see the LICENSE included in this repository for a full copy of the MIT license, which this project is licensed under.