KITTI-360 conversion for Scene Recognition tasks

Scene Recognition is a problem, where a set of visible objects must be correctly associated with objects marked on a semantic map - this problem is also sometimes called a Data Association. Please note, that Scene Recognition in terms where the observed scene must be labeled in terms such as 'kitchen', 'bedroom' and so on is a different problem.

Scene Recognition is used in different mobile robot localization problems, such as semantic mapping, global localization, kidnapping, loop-closing and so on. But problem is that there is no special dataset for that task (at least I found none). Such dataset should have the following features:

a semantic map with globally marked objects;
scenes data (views from robot camera), where objects are also labeled and their IDs are correctly matched with IDs on the Semantic map.

I made a script that helps convert popular KITTI-360 dataset to such terms. KITTI-360 has the same problem, that IDs of 3D labels are not matched with IDs of 2D labels. My script does this matching. You are welcome to download already prepared semantic maps and scene sets, or use my script to create your own with different settings.

From 3D boxes Semantic map is created by determining center of each object in global coordinates.
For each camera pose, visible objects from the Semantic map are selected based on the camera's field of view and the maximum distance from the lidar.
3D boxes of those objects are projected to bounding rectangles on camera plane by its calibration parameters.
2D semantic masks are being matched with those rects by hungarian optimization of IoU score - there are global match of IDs is obtained.
Lidar points are projected to image plane, than points on object mask are selected, and distances to objects are calculated.
Raw images are used to obtain CLIP features, but first the image is cropped by its masks, filling extra areas with grey color.
All data exported in .csv and .yaml formats.

Download prepared sequences

Sequence	Objects in map	Scenes number	Submaps*	Video**	Files
00	2333	9974	8, +22%	youtube rutube	449.8 MiB
02	1720	8304	5, +7%	youtube rutube	214.4 MiB
03	133	453	-	youtube rutube	10.4 MiB
04	1576	6892	6, +7%	youtube rutube	188.3 MiB
05	830	4798	3, +8%	youtube rutube	122.7 MiB
06	1619	7552	6, + 14%	youtube rutube	213.3 MiB
07	476	1267	-	youtube rutube	32.1 MiB
09	1978	10330	5, +22%	youtube rutube	353.2 MiB
10	1298	2174	3, +0%	youtube rutube	109.5 MiB

* - Number of submaps with total objects number increase.
** - on videos all visible objects - that are determined on step 2 - are marked, but those, which are with small mask or with small amount of lidar points are marked with gray color.

Each sequence has the following data:

Semantic map in .csv format. It contains object poses, its classes and global IDs, frames they were seen on and mean CLIP-features.

gid	class	x	y	z	frames	mf0	...	mf511
int	str	float	float	float	[int] as str	float	float	float

scenes folder with scene data where each scene is stored as .yaml file. Each file has scene number, camera pose (in global frame) defined by transform, objects with classes, global IDs (matched with IDs in map), object poses (in camera frame) defined by lidar distances and angles on its mask's center, bounding rects and CLIP-features.

frame_no: #int
objects:
  gid1: #int
    cam_pose: #[x, y, z] as floats - pose of object in camera frame
    features: #[x512] as floats
    label: #string
    lidar_dist_mean: #float
    lidar_dist_med: #float
    lidar_dist_min: #float
    lidar_dist_q1: #float
    lidar_dist_wavg: #float
    mask_angles: #[x, y] as floats
    rect: #[x1, y1, x2, y2] as ints
  ...
  gidN: ...
transform: # cam to map as 4x4 matrix

.yaml file with export params (see export script below).
Map image with objects and track.
[new in r0.0.2] Added submaps folder for sequences with big maps. There are map is divided on smaller areas with intersection with each other. Each folder has a set of files named semantic_submap_#.csv in the same format as the main Semantic Map. Also has exported parameters, as well images, shown how submaps are organized. The script for division is also provided and documented below.

Also please note:

In all sequences next labels are excluded: driveway, fence, ground, pedestrian, railtrack, road, sidewalk, unknownConstruction, unknownGround, unknownObject, vegetation, wall, guardrail.
Scenes has only objects, that has enough (see params below) mask area as well enough lidar points in it. For that reason, some objects like lamps very rarery occur in scenes, because they are away from Lidar field of view.
Map has objects that have been detected in scenes at least once.
Scenes must have at least one object, frames with zero objects seen are excluded (even from videos).

Or use provided script

Installation

Clone this repository:

git clone https://github.com/MoscowskyAnton/scene_recognition_kitti_360

Install requirements:

pip3 install -r requirements.txt

Download KITTI-360 stuff, you will need:

Perspective Images (needed camera 0, it can be set in download script).
Semantics of Left Perspective Camera.
Raw Velodyne Scans.
3D Bounding Boxes.
Calibrations.
Vechicle Poses.

Usage

Run scripts with params tuned:

python3 scripts/semantic_map_and_scene_extractor.py --kitti_360_path ~/KITTI-360 --sequence 05 ...

--kitti_360_path (str, required) path to KITTI-360 folder.
--save_path (str, default: None) path to save data, if provided will create subfolder with name scene_recogntion/sequence##/save_%d_%m_%H_%M, if not provided such subdir will be created in --kitti_360_path.
--sequence (str, default: '00') chosen data sequence.
--min_frame (int, default: -1) Frame to start with, if -1 will starts from the very begining.
--max_frame (int, default: -1) Frame to end with, if -1 will go to the very end of sequence.
--to_the_end (bool, default: false) If set, will override inner object frame marking and will inspect all path to the end. Problem is that inner object marking in KITTI-360 may be with errors, and some objects seen again after some time may not be connected with current frame. This flag fixes that but increases time spent.
--min_object_area_px (int, default: 50) Masks with areas (in pixels) less than that value will be rejected (such object will not be added neither scene neither map).
--max_object_dist_m (float, default: 50) Objects further that that value will not be added to scene.
--iou_cost_th (float, default: 0.2) IoU threshold for mask-rect matching.
--min_lidar_intensity (float, defualt: 0.2) Lidar points with intensity value less that that will be discarded.
--min_lidar_points (int, defualt: 10) If object mask has less lidar points on it than that value, that object will be discarded from scene. Note that nly points with intensity higher than --min_lidar_points encounted.
--do_clip (bool, default: false) If set, calculates CLIP features for objects.
--objects_ignore (list of str, default: empty) Set additional object labels for ignore.
--save_map_unlabeled (bool, default: false) If set saves image of map without object IDs.
--save_map_labeled (bool, default: false) If set saves image of map with object IDs. Result is OK only for small maps. For eg. limited with --min_frame and --max_frame params.
--video_cam (bool, default: false) If set, saves video from camera 0, where objects IDs, labels, rects and mask are drawn.
--video_map (bool, default: false) If set, saves top-down video where objects on scene are associated with map.
--video_mix (bool, default: false) If set, saves mixed video of camera 0 and top-down map. Examples of such videos are in links are given in the table above.
--plot_invisible (bool, default: false) If set, also draws discarded objects on current scenes. Such objects are drawn with grey color. Helps to understand why some of objects are not on scene.

Script also saves the params as .yaml file when running.

Script for Map division

Theory

For obtaining submaps with intersections, there following algorithm is proposed:

Do initial clustering (with AgglomerativeClustering), where n_cluster=ceil(all_objects/submap_max_size). If some cluster exceeds submap_max_size it is clustered further.
Find cluster centers and calculate 'direction diagrams' for each center. Direction diagram is an array of maximum ranges from the center to cluster elements for each sector with the same fixed division angle.
Each sector in the direction diagram is extended beyond the last object of the cluster for a fixed distance increase_range. The objects of other clusters which fall within this extended sector are added to the main cluster.

Usage

Run scripts with params tuned:

python3 semantic_map_divider.py --path ~/Dataset/KITTI-360/KITTI-360/scene_recogntion/sequence09/save_03_09_12_11/ ...

--path (str, must be provided) Path to folder with sequence data, obtained with semantic_map_and_scene_extractor.py script.
--submap_max_size (int, default: 300) Max size of submaps in objects.
--increase_range (float, default: 50) Range used to extend clusters.
--dir_diag_size (int, default: 32) Size of direction diagrams.
--objects_ignore (list of str, default: empty) Excludes objects from map.

Script also saves the params as .yaml file when running.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
doc		doc
scripts		scripts
LICENSE		LICENSE
readme.md		readme.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KITTI-360 conversion for Scene Recognition tasks

Download prepared sequences

Or use provided script

Installation

Usage

Script for Map division

Theory

Usage

About

Releases 2

Languages

License

MoscowskyAnton/scene_recognition_kitti_360

Folders and files

Latest commit

History

Repository files navigation

KITTI-360 conversion for Scene Recognition tasks

Download prepared sequences

Or use provided script

Installation

Usage

Script for Map division

Theory

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages