StreamDR is a novel multi-level parallelization dimensionality reduction method for streaming high-dimensional data visualization, which consists of three modules: new data embedding, embedding function updating and embedding updating. In this approach, the above modules were re-designed to support better parallelization, thus alleviating the latency caused by module waits in the serial setup. Subsequently, a series of new designs and improvements for the three modules were proposed, such as a fast and stable incremental embedding function updating method and an efficient hybrid embedding updating strategy, to achieve fast, high-quality and temporal-stable streaming high-dimensional data visualization.
This project was based on python 3.7 and pytorch 1.7.0
. See requirements.txt
for all prerequisites, and you can also install them using the following command.
pip install -r requirements.txt
Instances | Dimensionality | Categories | Data Type | Link | |
---|---|---|---|---|---|
AReM | 42,239 | 6 | 7 | tabular | UCI |
Basketball | 52,913 | 6 | 5 | tabular | UCI |
Shuttle | 43,500 | 9 | 6 | tabular | UCI |
HAR | 10,299 | 561 | 7 | text | UCI |
MNIST | 60,000 | 784 | 10 | image | Paper |
All the datasets are supported with H5 format (e.g. HAR.h5), and we need all the dataset to be stored at data/data_files
.
To simulate the collected data set as a streaming data set with various data change pattern, try the following command:
python simulate_streaming_data.py --datasets HAR --change_modes PD
The configuration files can be found under the folder ./configs
, and we provide two config files with the format .yaml
. We give the guidance of several key parameters in this paper below.
- n_neighbors(k): It determines the granularity of the local structure to be maintained in low-dimensional space. A too small value will cause one cluster in the high-dimensional space be projected into two low-dimensional clusters, while too large value will aggravate the problem of clustering overlap. The default setting is k = 15.
- preserve_weight(α): It determines the degree of preserving the embedding quality of the existing data during the updating of the embedding function. The default setting is α = 10.
- candidate_coefficient(β): It determines how much range of data will be selected as candidate neighbors when approximately computing kNN. The default setting is β = 10.
- global_pattern_change_thresh(Γ): It determines how many out-of-distribution data are detected to consider a change in the global pattern of the data. The default setting is Γ = 50.
- initializing_epochs: It determines the number of iterations on the pre-existing set when initializing the embedding function. The default setting is 200.
- updating_epochs: It determines the number of iterations on the streaming set when updating the embedding function. The default setting is 50.
To launch a streaming data processing with StreamDR, check the configuration file configs/StreamDR.yaml
, and try the following command:
python stream_main.py --configs configs/StreamDR.yaml
The following is a comparison of the results of different dimensionality reduction methods for streaming high-dimensional data on HAR dataset.