Language Supervised Human Action Recognition with Salient Fusion: Construction Worker Action Recognition as a Use-Case

Mohammad Mahdavian¹ Mohammad Loni² Mo Chen¹

1 School of Computing Science, Simon Fraser University, Burnaby, BC, Canada

2 Future Solutions Department, Volvo Construction Equipment, Eskilstuna, Sweden

arxiv Code VolvoConstAct Dataset Video

llustration of the a human instructing an industrial Volvo autonomous machine.

Abstract

Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process for skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize the feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities' high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named \VCEdataset. This dataset serves to facilitate the training and evaluation of autonomous construction machines performing specific tasks. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets, NTU-RGB+D, NTU+RGB+D120 and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications.

VolvoConstAct Video Samples

All OK?

Arm Circle

Back Up

Come

Emergency

Excavator Swing Left

Excavator Swing Right

Follow Me

Turn Left

Turn Right

Do You See Me?

Slow Down

Stop

Load Lift Down

Load Lift Up

Wheel Loader Tilt Down

Wheel Loader Tilt Up

Wheel Loader Tilt Up

Language Supervised Human Action Recognition with Salient Fusion: Construction Worker Action Recognition as a Use-Case

Mohammad Mahdavian1 Mohammad Loni2 Mo Chen1

arxiv Code VolvoConstAct Dataset Video

llustration of the a human instructing an industrial Volvo autonomous machine.

Abstract

VolvoConstAct Video Samples

All OK?

Arm Circle

Back Up

Come

Emergency

Excavator Swing Left

Excavator Swing Right

Follow Me

Turn Left

Turn Right

Do You See Me?

Slow Down

Stop

Load Lift Down

Load Lift Up

Wheel Loader Tilt Down

Wheel Loader Tilt Up

Wheel Loader Tilt Up

Mohammad Mahdavian¹ Mohammad Loni² Mo Chen¹