EvIs-Kitchen Dataset

Shinoda Laboratory, Tokyo Institute of Technology

Denso IT Recognition Learning Algorithm Laboratory

[ Information ]:

Egocentric video and Inertial sensor data Kitchen activity dataset is the first V-S-S interaction-focused dataset for the ego-HAR task.

It consists of sequences of everyday kitchen activities as it involves rich interactions among the subject's body, object, and environment.

Besides the egocentric videos recorded by GoPro camera, our dataset also includes the inertial sensor data recorded from the Fitbit watches attached on the subject's wrists, which are synchronized correlated with video data stream.

In total, our dataset contains 4,527 action samples from 12 subjects and 7 recipes, with 35 verb classes label and 56 noun classes label.

[ Content ]:

Our dataset contains 4 major folders: /Annotation, /Video, /RGB-frames, and /Sensor

/Annotation:

The annotation of all action segments are in one csv file. Each line in this file is an annotation for a sample:

  • narration_id ("S01R01_011"): "S01" means this action is from subject-1. "R01" means it is from recipe-1. The following "011" is the index of this action in the entire cooking process.
  • verb ("crack"): The Verb label of this action segment.
  • noun ("egg"): The Noun label of this action segment.
  • start_frame (4215): The index of frame (in RGB-frames sequence and in Sensor sequence) when this action starts.
  • stop_frame (4394): The index of frame (in RGB-frames sequence and in Sensor sequence) when this action ends.
  • start_time (02:20.5): The time ine the Video when this action starts.
  • stop_time (02.26.4): The time ine the Video when this action ends.
  • temporal_length (5976): The temporal length how long does this action last (with ms as unit).

/Video:

The original raw video recorded by the GoPro camera, with 1920x1080 resolution in 60fps. Each MP4 file is a complete process of one subject cooking one of the recipes, and contains many action segments.

/RGB-frames:

The 30fps video frames sequence of each long video in /Video directory. Each folder contains the frame sequence for the corresponding long cooking video.

All frame image is resize to 228x128 for reducing the redundancy, saving more GPU memory cost during the training.

/Sensor:

The 30fps inertial sensor data recorded by the Fitbit watches in npy format. Each npy file contains the complete sensor data sequence for the corresponding long cooking video.

For the sensor data sequence, the shape of each frame is (2,10). The first dimension means the left/right hands, their order is [left, right]. The second dimension means the 10 inertial sensor data, which are: 3-axis accelerometer, 3-axis gyroscope, 4-digit orientation. The order of the 10 inertial sensor data is: [acc-x, acc-y, acc-z, gyro-x, gyro-y, gyro-z, ori-a, ori-b, ori-c, ori-d]

[ Publication ]:

Please cite our paper if you want to use this Dataset for your Research.

@inproceedings{hao2023evis,
    title={EvIs-Kitchen: Egocentric Human Activities Recognition with Video and Inertial Sensor Data},
    author={Hao, Yuzhe and Uto, Kuniaki and Kanezaki, Asako and Sato, Ikuro and Kawakami, Rei and Shinoda, Koichi},
    booktitle={International Conference on Multimedia Modeling},
    pages={373--384},
    year={2023},
    organization={Springer}
}
@article{hao2024egocentric,
    title={Egocentric Human Activities Recognition with Multi-modal Interaction Sensing},
    author={Hao, Yuzhe and Kanezaki, Asako and Sato, Ikuro and Kawakami, Rei and Shinoda, Koichi},
    journal={IEEE Sensors Journal},
    year={2024},
    publisher={IEEE}
}

[ Access ]:

Please send an application email to yuzhe[at]ks.c.titech.ac.jp, with including your Name, Institute, Purpose, to obtain the access of the dataset.