This is source code and prompts for paper "CN-PDS: China’s Public Data Search Enabled by LLMs".
Prompts for metadata consolidation, metadata enrichment, dataset reranking, and relevance explanation are at ./prompts/
.
./cnpds-frontend-vue
, based on Vue3 and Tabler: An HTML Dashboard UI Kit built on Bootstrap.
Deploy the frontend using nginx inside of a docker container.
Expose: Port 80. Env-Vars:
VITE_BACKEND_HOST
;VITE_BEIAN_NUMBER
.
Image build command:
# PWD: ./ChinaOpenDataPortal-Frontend-Vue
docker build -f docker/Dockerfile -t username/imagename .
Push image to Docker Hub:
docker push username/imagename:latest
Use this command to create a basic
env.custom.sh
:echo "PYTHON_PATH=$(realpath $(which python))" >> ./scripts/env.custom.sh
./cnpds-backend-java
, based on Sprint Boot and Thymeleaf, with ability of acting as API server for other frontend service.
Default index path is indices/current
.
Use scripts/start-server.java.sh
to start a process as API server and backend.
Append more arguments as you need like ./scripts/start-server.java.sh --server.port=9998
.
Environmet variables you may want to specify in env.custom.sh
:
- MAVEN_PATH
- JAVA_PATH (Java 11 Recommended)
- ADMIN_USER
- ADMIN_PSWD
- PYTHON_RETRY_TIMES
./cnpds-backend-flask
, based on Flask, providing ability of communicating with Large Language Models and querying MySQL database.
Use scripts/start-server.py.sh
to start a process as python backend.
Append more arguments as you need like ./scripts/start-server.py.sh --host 0.0.0.0
.
Environmet variables you may want to specify in env.custom.sh
:
- FLASK_PATH
- FLASK_PORT
./cnpds-metadata
: Python scripts crawling metadata from each portals, multi-threads supported. Crawled metadata will be written to database for next step usage../cnpds-index-builder
Use scripts/fetch-data.sh
to start a process for metadata fetching.
PS: There is one table that serves as an archive table for each metadata crawl and another table for index building.
Environmet variables you may want to specify in env.custom.sh
:
- About Crawler Control:
- CRAWL_WORKERS
- CRAWL_FILES (whether or not to download datafiles)
- About database (Only MySQL Supported):
- DB_ADDR
- DB_PORT
- DB_USER
- DB_PSWD
- DATABASE_NAME
- REF_TABLE_NAME (specify a table as template)
- PRD_TABLE_NAME (specift a table which used in production)
- Others:
- PYTHON_PATH (Python 3.6 Recommended)
After writing into database, the index builder will be started.
If necessary, it will link the latest index to the path indices/current
.
If current index has been updated,the server will receive a POST request and refresh index path.
Environmet variables you may want to specify in env.custom.sh
:
- BACKEND_URL
+-------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+----------------+
| dataset_id | int | NO | PRI | NULL | auto_increment |
| title | varchar(255) | YES | | NULL | |
| description | text | YES | | NULL | |
| tags | text | YES | | NULL | |
| department | varchar(255) | YES | | NULL | |
| category | varchar(255) | YES | | NULL | |
| publish_time | varchar(255) | YES | | NULL | |
| update_time | varchar(255) | YES | | NULL | |
| is_open | varchar(255) | YES | | NULL | |
| data_volume | varchar(255) | YES | | NULL | |
| industry | varchar(255) | YES | | NULL | |
| update_frequency | varchar(255) | YES | | NULL | |
| telephone | varchar(255) | YES | | NULL | |
| email | varchar(255) | YES | | NULL | |
| data_formats | varchar(255) | YES | | NULL | |
| url | text | YES | | NULL | |
| province | varchar(255) | YES | | NULL | |
| city | varchar(255) | YES | | NULL | |
| standard_industry | varchar(255) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+----------------+
+-------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+----------------+
| dataset_id | int | NO | PRI | NULL | auto_increment |
| title | varchar(255) | YES | | NULL | |
| description | text | YES | | NULL | |
| tags | text | YES | | NULL | |
| publish_time | varchar(255) | YES | | NULL | |
| update_time | varchar(255) | YES | | NULL | |
| industry | varchar(255) | YES | | NULL | |
| department | varchar(255) | YES | | NULL | |
| is_open | varchar(255) | YES | | NULL | |
| data_volume | varchar(255) | YES | | NULL | |
| update_frequency | varchar(255) | YES | | NULL | |
| data_formats | varchar(255) | YES | | NULL | |
| url | text | NO | | NULL | |
| province | varchar(255) | YES | | NULL | |
| city | varchar(255) | YES | | NULL | |
| standard_industry | varchar(255) | YES | | NULL | |
| url_hash | varchar(255) | NO | UNI | NULL | |
| origin_metadata | mediumtext | YES | | NULL | |
| fetch_time | varchar(255) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+----------------+
+----------------------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+------+------+-----+---------+-------+
| dataset_id | int | YES | | NULL | |
| description_enhanced | text | YES | | NULL | |
+----------------------+------+------+-----+---------+-------+
+------------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+------+------+-----+---------+-------+
| dataset_id | int | NO | | NULL | |
| path | text | YES | | NULL | |
+------------+------+------+-----+---------+-------+
version: '3.4'
services:
frontend:
image: 'username/imagename:latest'
container_name: codp-frontend-vue
restart: unless-stopped
ports:
- 'HOST_PORT:80'
environment:
- 'VITE_BACKEND_HOST=VITE_BACKEND_HOST_PLACEHOLDER'
- 'VITE_BEIAN_NUMBER=VITE_BEIAN_NUMBER_PLACEHOLDER'
Params:
- Image Name.
- Host Port.
- Env-Var: VITE_BACKEND_HOST.
- Env-Var: VITE_BEIAN_NUMBER.
Recommended:
nohup bash ./scripts/start-server.java.sh >> ./logs/server.java.txt 2>&1 &
Recommended:
nohup bash ./scripts/start-server.py.sh >> ./logs/server.py.txt 2>&1 &
Recommended:
# the first of every month at 0:00
echo "0 0 1 */1 * __ROOT=\"`realpath .`\"; bash \${__ROOT}/scripts/fetch-data.sh > \"\${__ROOT}/logs/fd-\`date --i\`.txt\" 2>&1" >> logs/auto-task.txt
crontab logs/auto-task.txt