-
Notifications
You must be signed in to change notification settings - Fork 0
/
Data_description.txt
75 lines (72 loc) · 5.24 KB
/
Data_description.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
1. Datasets (.csv or .xlsx file) and numpy (.npy file)
***********************************************************
***********************************************************
(1) Data_processing.ipynb (Path: Datasets)
***********************************************************
(a) Data_SinaWeibo.csv: The initial dataset obtained from the Python Script
(b) UserData.csv - Section 2.7 - The initial user profile datasets with preprocessing (only users with profiles)
(c) UserProfile.csv - Section 2.7 - The user profile datasets with necessary features (based on UserData; only users with profiles)
(d) ForCheck.csv - Section 3.3.2 - The dataset with "@" in F_Comment
(e) ManuCode.xlsx - Section 3.5.1 - The dataset used to manually decode emoji
(f) ManuCode_Finish.xlsx - Section 3.5.1 - The dataset completed with manual decoding
(g) UserInteraction.csv - Section 4.1.1 - The initial user interactions dataset (all users)
(h) UserFeature_Graph.csv - Section 4.1.4 (2) - The final user profile datasets (all users)
(i) UserInter_Graph.csv - Section 4.1.5 (4) - The user interaction dataset (all users; only includes username and user node index)
(j) Sina_allInfo.csv - Section 4.2.1 - The dataset with all interaction information
(k) UserPost.csv - Section 4.2.3 (2) - The dataset of user-comment correspondences
***********************************************************
***********************************************************
(2) K-Prototype.ipynb: exp1 (Path: Train_record/KPrototype)
***********************************************************
(a) kp_user.csv - Section 1.1 - The original user features for K-Prototype
(b) kp_profile.csv - Section 1.1 - The user features with Standardization (based on kp_user.csv)
(c) kp_cost.npy - Section 1.2 - The K-Prototype cost for each K (K \in [1, 12])
(d) kp_centroid.npy - Section 1.3 - The cluster centroid for optimal K (K=5)
(e) kp_label.npy - Section 1.3 - Cluster labels for optimal K (K=5)
(b) kp_tsne.csv - Section 1.4 - t-SNE results for optimal K (K=5)
***********************************************************
***********************************************************
(3) Ablation.ipynb: exp2 (Path: Train_record/Ablation/Result)
i: Hyperparameter setting i (i \in [1,6])
j: Number of clusters j (j \in [5, 6])
k: Label index with K=6 (k \in [0, 5])
------------------------------------------------------------
(a) repre$_i$.npy - Section 3.1 - User node representation from the convergent model (HomoG)
(b) m$_i$Known_costs.npy - Section 3.2.$i$ - K-means costs of users with known profiles (2803 in total)
(c) m$_i$Known_centroid$_j$.npy - Section 3.2.$i$ - The cluster centroid for K = j
(d) m$_i$Known_label$_j$.npy - Section 3.2.$i$ - Cluster labels for K = j
(e) m$_i$Known_tsne$_j$.csv - Section 3.2.$i$ - t-SNE results for K = j
(f) m4_l$k$.csv - Section 4.3.$k$ - The user-comment dataset with Label $k$ for K = 6, Model 4 (M2: \epsilon^2 = 0.5)
***********************************************************
***********************************************************
(4) Model.ipynb: exp3 (Path: Train_record/Model)
i: Hyperparameter setting i (i \in [1,6])
------------------------------------------------------------
(a) Model$i$/All_loss.npy: Training loss of HG-PD
***********************************************************
***********************************************************
(5) Analysis_visualize.ipynb: exp3 (Path: Train_record/Model/Result)
i: Hyperparameter setting i (i \in [1,6])
j: Number of clusters j (j \in [5, 6])
k: Label index with K=6 (k \in [0, 5])
------------------------------------------------------------
(a) repre$_i$.npy - Section 1.1 - User node representation from the convergent model (HG-PD)
(b) m$_i$Known_costs.npy - Section 1.2.$i$ - K-means costs of users with known profiles (2803 in total)
(c) m$_i$Known_centroid$_j$.npy - Section 1.2.$i$ - The cluster centroid for K = j
(d) m$_i$Known_label$_j$.npy - Section 1.2.$i$ - Cluster labels for K = j
(e) m$_i$Known_tsne$_j$.csv - Section 1.2.$i$ - t-SNE results for K = j
(f) m$_i$_profile.csv - Section 3 - The user feature dataset with labels (K=6)
(g) m4_l$k$.csv - Section 6.$k$ - The user-comment dataset with Label $k$ for K = 6, Model 4 (M2: \epsilon^2 = 0.5)
***********************************************************
2. Tensor (.pt file)
***********************************************************
(1) Data_processing.ipynb (Path: Tensor)
***********************************************************
(a) user_f - Section 4.1.4 (2) - User features for all users (obtained from UserFeature_Graph)
(b) user_embed_idx - Section 4.1.4 (4) - Users embedding indices (only users without profiles)
(c) userInter_data - Section 4.1.5 (6) - User interactions (identified by user node indices)
(d) userPost_comm - Section 4.2.3 (4) - User-comment correspondences (identified by user node indices and comment ones)
(e) Comment_Vector - Section 4.2.4 (2) - Text vector representation for all comments (obtained from bert-base-chinese)
(f) userInter_self - Section 5.1.1 - User interactions with self-loop (identified by user node indices; obtained from userInter_data)
(g) HeteroData - Section 5.1.2 - HeteroG
(h) s_user_f - Section 5.2 - Standardization of numeric variables in user_f (all users; obtained from user_f)