东华大学学报（自然科学版）

2025, 06, v.51 62-69

融合CNN与ViT的深度伪造人脸篡改视频检测方法

陈傲白恩健

吴贇曹誉文蒋学芹

1.东华大学信息科学与技术学院

基金项目(Foundation): 国家自然科学基金(62301143); 上海市自然科学基金(24ZR1401000)

邮箱(Email): baiej@dhu.edu.cn;

DOI: 10.19886/j.cnki.dhdz.2024.0393

423	0	189
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

深度伪造视频检测是目前计算机视觉领域的热点研究问题。针对现有基于卷积神经网络(CNN)或视觉Transformer(ViT)的深度伪造检测技术普遍存在训练和测试阶段耗时较长、跨数据集检测精度显著下降等问题，提出一种融合CNN和ViT的检测方法。基于细节增强卷积(DEConv)和空间分组坐标注意力模块设计了一个卷积神经网络编码器模块，二者组合成特征提取分支；再与改进的ViT模块进行连接，模型兼具局部提取和全局建模的能力；最后，提出人脸非关键区域掩码策略(key-detect mask, KDM),使模型更专注于人脸关键区域，减少次要特征的干扰，提高模型在多扰动场景下的稳健性。试验结果表明，该方法在3个主流数据集上的平均视频级ROC曲线下面积(AUC)达99.13%,在跨库泛化性试验中平均视频级AUC达86.54%,该模型优于其他方法。

关键词： 深度伪造检测; 视觉Transformer; 人脸关键点; 注意力机制;

Abstract：

Detecting deepfake videos is a significant challenge in computer vision. Current methods based on convolutional neural network(CNN) or Vision Transformers(ViT) often encounter prolonged training and testing times and substantial accuracy degradation in cross-dataset scenarios. This paper proposed a detection method integrating CNN and ViT to address these issues. The method designs a CNN encoder module based on detail-enhanced convolution(DEConv) and a spatial group coordinate attention module, which is combined to form a feature extraction branch. This branch was connected to an improved ViT module, enabling the model to combine local feature extraction with global modeling capabilities. Finally, a Key-Detect Mask(KDM) strategy was proposed to focus the model's attention on key facial areas, minimizing interference from irrelevant features and improving robustness under perturbation. Experimental results indicated that the proposed method achieves an average video-level AUC of 99.13% on three benchmark datasets. In cross-dataset generalization experiments, it achieves an average video-level AUC of 86.54%, outperforming other methods.

KeyWords： deepfake detection; Vision Transformer; facial landmarks; attention mechanism;

参考文献

[1] GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.

[2] PARK P S,GOLDSTEIN S,O’GARA A,et al.AI deception:a survey of examples,risks,and potential solutions[J].Patterns,2024,5(5):100988.

[3] HALIASSOS A,VOUGIOUKAS K,PETRIDIS S,et al.Lips don’t lie:a generalisable and robust approach to face forgery detection[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Nashville,TN,USA.IEEE,2021:5037-5047.

[4] CAO J Y,MA C,YAO T P,et al.End-to-end reconstruction-classification learning for face forgery detection [C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).New Orleans,LA,USA.IEEE,2022:4103-4112.

[5] LI J M,XIE H T,LI J H,et al.Frequency-aware discriminative feature learning supervised by single-center loss for face forgery detection [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Nashville,TN,USA.IEEE,2021:6454-6463.

[6] FRANK J,EISENHOFER T,SCHÖNHERR L,et al.Leveraging frequency analysis for deep fake image recognition[C]//International Conference on Machine Learning.PMLR,2020:3247-3258.

[7] ZHOU S B,HU L,WANG Y L,et al.AIF-LFNet:all-in-focus light field super-resolution method considering the depth-varying defocus [J].IEEE Transactions on Circuits and Systems for Video Technology,2023,33(8):3976-3988.

[8] WODAJO D,ATNAFU S.Deepfake video detection using convolutional vision transformer[EB/OL].2021:2102.11126.https://arxiv.org/abs/2102.11126v3.

[9] WANG J K,WU Z X,OUYANG W H,et al.M2TR:multi-modal multi-scale transformers for deepfake detection [C]//Proceedings of the 2022 International Conference on Multimedia Retrieval.Newark NJ USA.ACM,2022:615-623.

[10] 祝恺蔓，徐文博，卢伟，等.多关键帧特征交互的人脸篡改视频检测 [J].中国图象图形学报，2022,27(1):188-202.ZHU K M,XU W B,LU W,et al.Deepfake video detection with feature interaction amongst key frames [J].Journal of Image and Graphics,2022,27(1):188-202.

[11] 李颖，边山，王春桃，等.CNN结合Transformer的深度伪造高效检测 [J].中国图象图形学报，2023,28(3):804-819.LI Y,BIAN S,WANG C T,et al.CNN and Transformer-coordinated deepfake detection [J].Journal of Image and Graphics,2023,28(3):804-819.

[12] 俞洋，袁家斌，蔡纪元，等.基于非关键掩码和注意力机制的深度伪造人脸篡改视频检测方法 [J].计算机科学，2023,50(11):160-167.YU Y,YUAN J B,CAI J Y,et al.Deepfake face tampering video detection method based on non-critical masks and attention mechanism [J].Computer Science,2023,50(11):160-167.

[13] KAZEMI V,SULLIVAN J.One millisecond face alignment with an ensemble of regression trees[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition.Columbus,OH,USA.IEEE,2014:1867-1874.

[14] CHEN Z X,HE Z W,LU Z M.DEA-net:single image dehazing based on detail-enhanced convolution and content-guided attention [J].IEEE Transactions on Image Processing,2024,33:1002-1015.

[15] GUO J L,CHEN X H,TANG Y H,et al.SLAB:efficient transformers with simplified linear attention and progressive re-parameterized batch normalization [EB/OL].2024:2405.11582.https://arxiv.org/abs/2405.11582v2.

[16] ROSSLER A,COZZOLINO D,VERDOLIVA L,et al.FaceForensics++:learning to detect manipulated facial images [C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV).Seoul,Korea.IEEE,2019:1-11.

[17] LI Y Z,YANG X,SUN P,et al.Celeb-DF:a large-scale challenging dataset for DeepFake forensics [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle,WA,USA.IEEE,2020:3204-3213.

[18] DOLHANSKY B,BITTON J,PFLAUM B,et al.The DeepFake detection challenge (DFDC) dataset[EB/OL].2020:2006.07397.https://arxiv.org/abs/2006.07397v4.

[19] ZHAO H Q,WEI T Y,ZHOU W B,et al.Multi-attentional deepfake detection [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Nashville,TN,USA.IEEE,2021:2185-2194.

[20] CHEN B J,JU X W,XIAO B,et al.Locally GAN-generated face detection based on an improved Xception[J].Information Sciences,2021,572:16-28.

[21] ZI B J,CHANG M H,CHEN J J,et al.WildDeepfake:a challenging real-world dataset for deepfake detection [C]//Proceedings of the 28th ACM International Conference on Multimedia.Seattle WA USA.ACM,2020:2382-2390.

[22] ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:a video vision transformer [C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV).Montreal,QC,Canada.IEEE,2021:6816-6826.

[23] WANG P,LIU K L,ZHOU W B,et al.ADT:anti-deepfake transformer [C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Singapore,Singapore.IEEE,2022:2899-2903.

[24] NEIMARK D,BAR O,ZOHAR M,et al.Video transformer network [C]//2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).Montreal,BC,Canada.IEEE,2021:3156-3165.

[25] ZHANG Y Y,LI X Y,LIU C H,et al.VidTr:video transformer without convolutions [C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV).Montreal,QC,Canada.IEEE,2021:13557-13567.

[26] KONG C Q,LUO A W,BAO P J,et al.MoE-FFD:mixture of experts for generalized and parameter-efficient face forgery detection [EB/OL].2024:2404.08452.https://arxiv.org/abs/2404.08452v2.

[27] COCCOMINI D A,MESSINA N,GENNARO C,et al.Combining EfficientNet andVision transformers forVideo deepfake detection [C]// Image Analysis and Processing-ICIAP 2022.Cham:Springer International Publishing,2022:219-229.

[28] ZHAO C R,WANG C T,HU G S,et al.ISTVT:interpretable spatial-temporal video transformer for deepfake detection[J].IEEE Transactions on Information Forensics and Security,2023,18:1335-1348.

[29] WANG S Y,WANG O,ZHANG R,et al.CNN-generated images are surprisingly easy to spot… for now [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle,WA,USA.IEEE,2020:8692-8701.

[30] LI L Z,BAO J M,ZHANG T,et al.Face X-ray for more general face forgery detection [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle,WA,USA.IEEE,2020:5000-5009.

[31] CHAI L,BAU D,LIM S N,et al.What makes fake images detectable?Understanding properties that generalize [C]// Computer Vision-ECCV 2020.Cham:Springer International Publishing,2020:103-120.

基本信息:

DOI：10.19886/j.cnki.dhdz.2024.0393

中图分类号:TP391.41;TP183

引用信息:

[1]陈傲,白恩健,吴贇,等.融合CNN与ViT的深度伪造人脸篡改视频检测方法[J].东华大学学报(自然科学版),2025,51(06):62-69.DOI:10.19886/j.cnki.dhdz.2024.0393.

基金信息:

国家自然科学基金(62301143); 上海市自然科学基金(24ZR1401000)

请选择需要下载的pdf数据

东华大学学报（自然科学版）

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文

请选择需要下载的pdf数据

东华大学学报（自然科学版）

使用微信“扫一扫”功能。将此内容分享给您的微信好友或者朋友圈

引用

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈