Abstract: This paper proposes massive and growing burden imposed on modern society by depression has motivated investigations into early detection through automated, scalable, and non-invasive methods, including those based on speech. However, speech-based methods that capture articulatory information effectively across different recording devices and in naturalistic environments are still needed. This article presents a novel multi-level attention-based network for multi-modal depression prediction that fuses features from audio, video, and text modalities while learning the intra and inter modality relevance. Multi-level attention reinforces overall learning by selecting the most influential features within each modality for decision-making. We perform exhaustive experimentation to create different regression models for audio, video, and text modalities. Evaluations of both landmark duration features and landmark n-gram features on the DAIC-WOZ and SH2 datasets show that they are highly effective, either alone or fused, relative to existing approaches.

Keywords: Depression classification, landmark n-grams, speech articulation, smartphone speech, naturalistic environments

PDF | DOI: 10.17148/IJARCCE.2023.125123

Open chat
Chat with IJARCCE