Skip to content

Search Engine with End-to-end. Based on crawling the pages at the uic.edu domain. Based on NLP techniques like Query expansion, PageRank.

Notifications You must be signed in to change notification settings

prajwalkk/Search-Engine-FS

Repository files navigation

Instructions to run the app

github location to dowload the files if not able load or use, is: https://github.com/prajwalkk/SearchEngineFS

use master branch to run it locally

command: git clone https://github.com/prajwalkk/SearchEngineFS.git

required python verson 3.7 and above. recommended - python3.8

This is tested in linux systems. I do not recommend windows as I do not own a windows machine. the python command may differ in your system. It could be python3 or python3.8 or python. My system has python3.8 as the command.

  1. Upgrade pip Windows powershell:
py -m pip --version
py -m pip install --upgrade pip
Linux and macOS:
python3 -m pip install --user --upgrade pip
python3 -m pip --version
  1. install virtualenv Create a virtual environment and activate it Windows:
py -m pip install --user virtualenv
py -m venv env
.\env\Scripts\activate
Linux:
python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate
python -m pip install --user --upgrade pip  (this upgraded the pip of the virtualenv)
  1. Install the required dependencies
pip install -r requirements.txt

** if any wheel error comes, do:

pip install wheel

  1. place the files in the search_engine folder (the one which contains manage.py) Run

python manage.py runserver

Wait for sometime until the message says the server is created and press Ctrl+C to exit. now go to the address specified in the terminal. normally localhost:8000

Files needed:

Crawling - Independent module. Present in the Crawler folder. To run use, Do not run it in the Crawler folder. Run it in the manage.py folder python Crawler/main.py It save the data in DataFiles/Crawled and DataFiles/Links with the current date as the folder.

  • PageRanker.py - this does the pagerank computation. To use the latest file, just open the python file and change the date eg 20200510 to the date on which the crawler was run (everywhere)
  • Vectorizer_pipeline.py - does all preprocessing. To use the latest file, just open the python file and change the date eg 20200510 to the date on which the crawler was run (everywhere)
  • analyse_query.py controller component to calculate. To use the latest file, just open the python file and change the date eg 20200510 to the date on which the crawler was run (everywhere)

Basic app execution flow:

  1. Crawl using python Crawler/main.py

  2. Change the dates in the PageRanker.py, analyse_query.py, vectorizer_pipeline.py to the latest crawled date wherever a date exists. Eg 20200510 becomes 20200511 (if run)

  3. Run the indexer files:

python vectorizer_pipeline.py
python analyse_query.py
  1. Run the app server. python manage.py runserver
  2. In brower, open http://127.0.0.1:8000/ (or any other port as specified in the console)
  3. Profit

This app will not run if the DataFiles are empty. Following files need to be in DataFiles Dir:

File Job
CrawledData\ this has all the crawled pages
Links\ this has all the graph data
dataFrame_bk.pkl this is the dataframe to persist the values
page_rank.pkl pagerank file
tfidf.joblib TFIDF matrix
vectorizer.joblib Inverted Index

About

Search Engine with End-to-end. Based on crawling the pages at the uic.edu domain. Based on NLP techniques like Query expansion, PageRank.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published