GitHub - prajwalkk/Search-Engine-FS: Search Engine with End-to-end. Based on crawling the pages at the uic.edu domain. Based on NLP techniques like Query expansion, PageRank.

Instructions to run the app

github location to dowload the files if not able load or use, is: https://github.com/prajwalkk/SearchEngineFS

use master branch to run it locally

command: git clone https://github.com/prajwalkk/SearchEngineFS.git

app deployed online: https://uic-search-pkk.herokuapp.com/

required python verson 3.7 and above. recommended - python3.8

This is tested in linux systems. I do not recommend windows as I do not own a windows machine. the python command may differ in your system. It could be python3 or python3.8 or python. My system has python3.8 as the command.

Upgrade pip Windows powershell:

py -m pip --version
py -m pip install --upgrade pip

Linux and macOS:

python3 -m pip install --user --upgrade pip
python3 -m pip --version

install virtualenv Create a virtual environment and activate it Windows:

py -m pip install --user virtualenv
py -m venv env
.\env\Scripts\activate

Linux:

python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate
python -m pip install --user --upgrade pip  (this upgraded the pip of the virtualenv)

Install the required dependencies

pip install -r requirements.txt

** if any wheel error comes, do:

pip install wheel

place the files in the search_engine folder (the one which contains manage.py) Run

python manage.py runserver

Wait for sometime until the message says the server is created and press Ctrl+C to exit. now go to the address specified in the terminal. normally localhost:8000

Files needed:

Crawling - Independent module. Present in the Crawler folder. To run use, Do not run it in the Crawler folder. Run it in the manage.py folder python Crawler/main.py It save the data in DataFiles/Crawled and DataFiles/Links with the current date as the folder.

PageRanker.py - this does the pagerank computation. To use the latest file, just open the python file and change the date eg 20200510 to the date on which the crawler was run (everywhere)
Vectorizer_pipeline.py - does all preprocessing. To use the latest file, just open the python file and change the date eg 20200510 to the date on which the crawler was run (everywhere)
analyse_query.py controller component to calculate. To use the latest file, just open the python file and change the date eg 20200510 to the date on which the crawler was run (everywhere)

Basic app execution flow:

Crawl using python Crawler/main.py
Change the dates in the PageRanker.py, analyse_query.py, vectorizer_pipeline.py to the latest crawled date wherever a date exists. Eg 20200510 becomes 20200511 (if run)
Run the indexer files:

python vectorizer_pipeline.py
python analyse_query.py

Run the app server. python manage.py runserver
In brower, open http://127.0.0.1:8000/ (or any other port as specified in the console)
Profit

This app will not run if the DataFiles are empty. Following files need to be in DataFiles Dir:

File	Job
`CrawledData\`	this has all the crawled pages
`Links\`	this has all the graph data
`dataFrame_bk.pkl`	this is the dataframe to persist the values
`page_rank.pkl`	pagerank file
`tfidf.joblib`	TFIDF matrix
`vectorizer.joblib`	Inverted Index

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Crawler		Crawler
DataFiles		DataFiles
search_engine		search_engine
uicsearch		uicsearch
.gitignore		.gitignore
PageRanker.py		PageRanker.py
Procfile		Procfile
README.md		README.md
README.txt		README.txt
Web Search Engine.docx		Web Search Engine.docx
analyse_query.py		analyse_query.py
manage.py		manage.py
nltk.txt		nltk.txt
requirements.txt		requirements.txt
vectorizer_pipeline.py		vectorizer_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instructions to run the app

app deployed online: https://uic-search-pkk.herokuapp.com/

Files needed:

About

Releases

Packages

Languages

prajwalkk/Search-Engine-FS

Folders and files

Latest commit

History

Repository files navigation

Instructions to run the app

app deployed online: https://uic-search-pkk.herokuapp.com/

Files needed:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages