Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion anomaly. To better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.
Our proposed HumanSAM framework consists of three main components: (1) A dual-branch architecture that combines video understanding and spatial depth analysis to capture geometry, semantics, and spatiotemporal consistency features; (2) A rank-based confidence enhancement strategy that leverages three prior scores to learn more robust representations during training; and (3) A multi-class classification head that categorizes human-centric forgeries into spatial, appearance, and motion anomaly types. The framework processes input videos through both RGB and depth modalities, extracting complementary features that are fused to create comprehensive human forgery representations for accurate classification.
Multi-class classification results showing the performance of HumanSAM in distinguishing between different types of human-centric forgery anomalies including spatial, appearance, and motion categories.
General binary classification results demonstrating HumanSAM's effectiveness in general forgery detection, comparing real and fake human-centric videos across different datasets and evaluation metrics.
@article{liu2025humansam,
title={HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly},
author={Liu, Chang and Ye, Yunfan and Zhang, Fan and Zhou, Qingyang and Luo, Yuchuan and Cai, Zhiping},
journal={arXiv preprint arXiv:2507.19924},
year={2025}
}