HumanSAM:ClassifyingHuman-centricForgeryVideos inHumanSpatial,Appearance,andMotionAnomaly

HumanSAM : Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly

Chang Liu^*,1, Yunfan Ye^*,2, Fan Zhang¹, Qingyang Zhou¹, Yuchuan Luo¹, Zhiping Cai^†,1

¹ National University of Defense Technology, ² Hunan University
ICCV 2025
^*Indicates Equal Contribution
^†Corresponding Author

Abstract

Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion anomaly. To better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.

Method

Our proposed HumanSAM framework consists of three main components: (1) A dual-branch architecture that combines video understanding and spatial depth analysis to capture geometry, semantics, and spatiotemporal consistency features; (2) A rank-based confidence enhancement strategy that leverages three prior scores to learn more robust representations during training; and (3) A multi-class classification head that categorizes human-centric forgeries into spatial, appearance, and motion anomaly types. The framework processes input videos through both RGB and depth modalities, extracting complementary features that are fused to create comprehensive human forgery representations for accurate classification.

Qualitative Display of Anomaly Images

Spatital Anomaly : It can be seen that in the first row, two metal knives blur and pass through each other; in the second row, a woman's hand blurs as it reaches into the harp; in the third row, a person's hand passes through a clay pot under production without leaving any traces. Overall, this violates spatial logic and the normal rules of object interaction.

Appearance Anomaly : It can be seen that in the first row, the hand on the right suddenly changes from an apparently left hand to a right hand. In the second row, the number of fingers on the right hand changes from six to five. In the third row, the object held in the hand gradually disappears. Generally speaking, the consistency in appearance cannot be maintained.

Motion Anomaly : It can be observed that in the first row, the woman's body maintains a forward-leaning tendency, but her head suddenly rotates 180 degrees. In the second row, the girl's right hand takes on the shape of a left hand. In the third row, the girl's right hand assumes the posture of a left hand, which would be appropriate if the girl's body were rotated around. Generally speaking, the motion of the characters does not conform to normal biological motion patterns.

BibTeX

@article{liu2025humansam, title={HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly}, author={Liu, Chang and Ye, Yunfan and Zhang, Fan and Zhou, Qingyang and Luo, Yuchuan and Cai, Zhiping}, journal={arXiv preprint arXiv:2507.19924}, year={2025} }

HumanSAM : Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly

HumanSAM framework overview showing classification of human-centric forgery videos into three distinct anomaly types: spatial, appearance, and motion anomaly.

Abstract

Method

Qualitative Display of Anomaly Images

Spatital Anomaly

Appearance Anomaly

Motion Anomaly

Multi-class Classification Results

General Binary Classification Results

BibTeX