Skip to content

akalongman/geo-words

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Georgian (ka_GE) word list

Download in: DIC | TXT | SQL

Data sources

Crawler

Crawler is written on PHP and uses MySQL as a database. Code placed under crawler folder.

Before running the script should be configured the database and run migrations.

First rename the file .env.example to .env and specify database credentials.

Install composer dependencies:

composer install

And run migrations:

composer migrate  

Usage

Crawl links with internal profile

This command will crawl urls only inside specified domain and ignore external urls

php cmd crawl --project-name="My Project" --profile=internal "http://www.nplg.gov.ge/gwdict/index.php"

Crawl links with all profile

This command will crawl all links

php cmd crawl --project-name="My Project" --profile=all "http://www.nplg.gov.ge/gwdict/index.php"

Crawl links with domain profile

This command will crawl links with all domains, which end with --domain

php cmd crawl --project-name="My Project" --profile=domain --domain=.ge "http://www.nplg.gov.ge/gwdict/index.php"

Will be crawled links, where url's domain ends with .ge suffix

Crawl links with subset profile

This command will crawl all urls if link starts with --subset

php cmd crawl --project-name="My Project" --profile=subset --subset="http://www.nplg.gov.ge/gwdict/index.php?a=list&d=1" "http://www.nplg.gov.ge/gwdict/index.php?a=list&d=1"

Will be crawled links, where url starts with http://www.nplg.gov.ge/gwdict/index.php?a=list&d=1 prefix

Continue project

You can continue stopped project by command

php cmd crawl --project-id={id}

Show all possible options: php cmd help crawl

TODO

  • Fix wrong entries and add more words
  • Add tests
  • Add notification sending on complete

License

Please see the LICENSE included in this repository for a full copy of the MIT license, which this project is licensed under.