TY - GEN
T1 - On the Fusion of RGB and Depth Information for Hand Pose Estimation
AU - Kazakos, Evangelos
AU - Nikou, Christophoros
AU - Kakadiaris, Ioannis A.
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/8/29
Y1 - 2018/8/29
N2 - Recent advances in deep learning have spurred 3D hand pose estimation, as convolutional network (ConvNet) based methods outperformed random forests. However, in the state of the art, ConvNet based methods employ only depth images of the hand without leveraging color and texture information from the RGB domain. In this paper, we investigate whether ConvNets can learn more rich and discriminative em-beddings, by combining RGB and depth information. To answer this question, we propose the fusion of RGB and depth information in a double-stream architecture. More specifically, RGB and depth images are fed into two separate networks by extracting features, which are subsequently fused at an intermediate layer of the ConvNet, implementing input-level fusion, feature-level fusion and score-level fusion. The double-stream scheme is coupled with a deep ConvNet, contrary to the shallow networks that are mostly proposed in the literature. Experimental results show that while the depth of the network is crucial for hand pose estimation, the double-stream nets perform very similarly with the net trained only with depth images. This may suggest that training double-stream architectures purely with supervision may be insufficient for hand pose estimation with RGB-D fusion.
AB - Recent advances in deep learning have spurred 3D hand pose estimation, as convolutional network (ConvNet) based methods outperformed random forests. However, in the state of the art, ConvNet based methods employ only depth images of the hand without leveraging color and texture information from the RGB domain. In this paper, we investigate whether ConvNets can learn more rich and discriminative em-beddings, by combining RGB and depth information. To answer this question, we propose the fusion of RGB and depth information in a double-stream architecture. More specifically, RGB and depth images are fed into two separate networks by extracting features, which are subsequently fused at an intermediate layer of the ConvNet, implementing input-level fusion, feature-level fusion and score-level fusion. The double-stream scheme is coupled with a deep ConvNet, contrary to the shallow networks that are mostly proposed in the literature. Experimental results show that while the depth of the network is crucial for hand pose estimation, the double-stream nets perform very similarly with the net trained only with depth images. This may suggest that training double-stream architectures purely with supervision may be insufficient for hand pose estimation with RGB-D fusion.
KW - Deep learning
KW - Double-stream networks
KW - Fusion
KW - Hand pose estimation
KW - Rgb-d
UR - http://www.scopus.com/inward/record.url?scp=85062903677&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062903677&partnerID=8YFLogxK
U2 - 10.1109/ICIP.2018.8451022
DO - 10.1109/ICIP.2018.8451022
M3 - Conference contribution
AN - SCOPUS:85062903677
T3 - Proceedings - International Conference on Image Processing, ICIP
SP - 868
EP - 872
BT - 2018 IEEE International Conference on Image Processing, ICIP 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th IEEE International Conference on Image Processing, ICIP 2018
Y2 - 7 October 2018 through 10 October 2018
ER -