Welcome to the IKCEST
Journal
IEEE Journal of Selected Topics in Signal Processing

IEEE Journal of Selected Topics in Signal Processing

Archives Papers: 394
IEEE Xplore
Please choose volume & issue:
AV-CrossNet: An Audiovisual Complex Spectral Mapping Network for Speech Separation by Leveraging Narrow- and Cross-Band Modeling
Vahid Ahmadi KalkhoraniCheng YuAnurag KumarKe TanBuye XuDeLiang Wang
Keywords:VisualizationSpeech enhancementFacesFeature extractionComputational modelingTrainingTime-domain analysisBackground noiseConvolutionTime-frequency analysisSpeech SeparationConvolutional LayersVisual CuesVisual FeaturesSeparation PerformanceEarly FusionPositional EncodingVisual EncodingTraining SetTraining DatasetDeep Neural NetworkBackground NoiseUtterancesImaginary PartLarge ImprovementOutput ChannelsNoisy EnvironmentsAuditory CuesSpeech SignalLinear LayerTime-domain ModelShort-time Fourier TransformAudiovisual SpeechdB RangeVisual StreamHourly DataGroup ConvolutionMulti-head Self-attentionAudio StreamFeature DimensionAudiovisual speech enhancementaudiovisual target speaker extractionaudiovisual speaker separationTF-CrossNetAV-CrossNet
Abstracts:Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the TF-CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, TCD-TIMIT, and COG-MHEAR challenge, in terms of the performance metrics of PESQ, STOI, SNR and SDR. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.
HPCNet: Hybrid Pixel and Contour Network for Audio-Visual Speech Enhancement With Low-Quality Video
Hang ChenChen-Yue ZhangQing WangJun DuSabato Marco SiniscalchiShi-Fu XiongGen-Shun Wan
Keywords:LipsVisualizationBenchmark testingVideo recordingSpeech enhancementDegradationCamerasTrainingQuality assessmentNoise measurementSpeech EnhancementHybrid PixelConvolutional LayersPerformance DegradationSimulation MethodTeacher ModelTraining VideosVideo QualityGraph ConvolutionSpeech QualityReconstruction ModuleContour FeaturesTraining SetLow ResolutionDeep Neural NetworkVisual FeaturesRate SetSemantic SimilarityVideo FramesGlobal Average PoolingHigh-quality VideoSpeech IntelligibilityClear SpeechTriplet LossMissing RateGraph Convolutional NetworkPixel CountAdjacency Matrix Of GraphAttention HeadsConduct Ablation ExperimentsSpeech enhancementaudio-visualgraph convolutional networktalking face generationknowledge distillation
Abstracts:To advance audio-visual speech enhancement (AVSE) research in low-quality video settings, we introduce the multimodal information-based speech processing-low quality video (MISP-LQV) benchmark, which includes a 120-hour real-world Mandarin audio-visual dataset, two video degradation simulation methods, and benchmark results from several well-known AVSE models. We also propose a novel hybrid pixel and contour network (HPCNet), incorporating a lip reconstruction and distillation (LRD) module and a contour graph convolution (CGConv) layer. Specifically, the LRD module reconstructs high-quality lip frames from low-quality audio-visual data, utilizing knowledge distillation from a teacher model trained on high-quality data. The CGConv layer employs spatio-temporal and semantic-contextual graphs to capture complex relationships among lip landmark points. Extensive experiments on the MISP-LQV benchmark reveal the performance degradation caused by low-quality video across various AVSE models. Notably, including real/simulated low-quality videos in AVSE training enhances its robustness to low-quality videos but degrades the performance of high-quality videos.The proposed HPCNet demonstrates strong robustness against video quality degradation, which can be attributed to (1) the reconstructed lip frames closely aligning with high-quality frames and (2) the contour features exhibiting consistency across different video quality levels. The generalizability of HPCNet also has been validated through experiments on the 2nd COG-MHEAR AVSE Challenge dataset.
Input-Independent Subject-Adaptive Channel Selection for Brain-Assisted Speech Enhancement
Qingtian XuJie ZhangZhenhua Ling
Keywords:ElectroencephalographyBrain modelingTrainingSpeech enhancementNoiseElectrodesSwitchesNoise measurementVoice activity detectionVectorsChannel SelectionSpeech EnhancementElectrodeSelection MethodChannel InformationBrain SignalsAdversarial TrainingAuditory AttentionHearing-impaired ListenersMemory PhenomenonTraining SetDeep Neural NetworkNeural ActivitySparsityIndependent Component AnalysisVersion Of TestPart Of FigLinear LayerTraining SubsetsAudio DataSpeech EnvelopeSubset Of ChannelsTarget SpeechSelection VectorAverage FitnessDepthwise Separable ConvolutionSelective LayerTotal Loss FunctionAudio StimuliSpeech QualityBrain-assisted speech enhancementelectroencephalogramchannel selectionsubject adaptionover memory
Abstracts:Brain-assisted speech enhancement (BASE) that utilizes electroencephalogram (EEG) signals as an assistive modality has shown a great potential for extracting the target speaker in multi-talker conditions. This is feasible as the EEG measurements contain the auditory attention of hearing-impaired listeners that can be leveraged to classify the target identity. Considering that an EEG cap with sparse channels exhibits multiple benefits and in practice many electrodes might contribute marginally, the EEG channel selection for BASE is desired. This problem was tackled in a subject-invariant manner in literature, the resulting BASE performance varies significantly across subjects. In this work, we therefore propose an input-independent subject-adaptive channel selection method for BASE, called subject-adaptive convolutional regularization selection (SA-ConvRS), which enables a personalized informative channel distribution. We observe the abnormal over memory phenomenon that facilitates the model to perform BASE without any brain signals, which often occurs in related fields due to the data recording and validation conditions. To remove this effect, we further design a task-based multi-process adversarial training (TMAT) approach by exploiting pseudo-EEG inputs. Experimental results on a public dataset show that the proposed SA-ConvRS can achieve subject-adaptive channel selections and keep the BASE performance close to the full-channel upper bound; the TMAT can avoid the over memory problem without sacrificing the performance of SA-ConvRS.
$C^{2}$AV-TSE: Context and Confidence-Aware Audio Visual Target Speaker Extraction
Wenxuan WuXueyuan ChenShuai WangJiadong WangLingwei MengXixin WuHelen MengHaizhou Li
Keywords:VisualizationData miningContext modelingTrainingPredictive modelsLipsFeature extractionSemanticsSignal processing algorithmsHidden Markov modelsTarget SpeakerContextual InformationVisual CuesConfidence ScoreAcoustic FeaturesConsistent ImprovementExtraction ModulePrediction ModelTraining SetLow ResolutionPrediction ScoreSimulated DatasetsVanillaContextual CuesBinary Cross-entropy LossMasked ImagesSpeaker RecognitionSpeech RateSpeech SegmentsTarget SpeechFine-tuning StrategyFine-tuning StageSpeech ContextMasking StrategySpeech UtterancesSpeech CodingPre-training StageValidation SetVisual EncodingSpeaker extractionself-supervised learningconfidencemultimodalcocktail party
Abstracts:Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.
Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience
Xilin JiangCong HanYinghao Aaron LiNima Mesgarani
Keywords:SemanticsSpeech enhancementNatural languagesLarge language modelsTrainingMusicSpeechSpectrogramData miningComputational modelingLanguage ModelControl VolumeSound SourceMixture Of SourcesInstructional TextFine-tunedTransformerScaling FactorNatural LanguageNeural SignalsSpeech TaskUse Of TextExpert ModelCascade SystemExtraction TaskSound ModelSynthetic MixtureTarget SpeechSignal-to-noise Ratio ImprovementTemporal Convolutional NetworkSemantic DescriptionPre-trained Language ModelsSoundscape remixingsound separationapplications of large lanugage models
Abstracts:In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces “Listen, Chat, and Remix” (LCR), a novel multimodal sound remixer that controls each sound source in a mixture based on user-provided text instructions. LCR distinguishes itself with a user-friendly text interface and its unique ability to remix multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for remixing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles filtered components back to the desired output. We developed a 160-hour dataset with over 100 k mixtures, including speech and various audio sources, along with text prompts for diverse remixing tasks including extraction, removal, and volume control of single or multiple sources. Our experiments demonstrate significant improvements in signal quality across all remixing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.
SAV-SE: Scene-Aware Audio-Visual Speech Enhancement With Selective State Space Model
Xinyuan QianJiaran GaoYaodan ZhangQiquan ZhangHexin LiuLeibny Paola Garcia PereraHaizhou Li
Keywords:Speech enhancementVisualizationNoiseNoise measurementTransformersSpeech processingLipsTrainingData modelsContext modelingState-space ModelSpeech EnhancementVisual InformationVisual CuesFacial MovementsCompetitive MethodsLip MovementsRich Contextual InformationDog BarkingVisual FeaturesLong Short-term MemoryVisual SceneNon-negative Matrix FactorizationSpeech SignalSelf-supervised LearningShort-time Fourier TransformAuditory SignalsSpeech IntelligibilityLong-range DependenciesTemporal CoherenceClear SpeechSpeech QualityTemporal Convolutional NetworkSemantic ConsistencyMean Opinion ScoreNoisy Input1D ConvolutionNoise RecordingsFeed-forward LayerTransformerSpeech enhancementaudio-visual fusionstate space model
Abstracts:Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. Scene-aware Audio-Visual Speech Enhancement (SAV-SE). To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we propose the VC-S $^{2}$ E method, which incorporates the Conformer and Mamba modules for their complementary strengths. Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results demonstrate the superiority of VC-S $^{2}$ E over other competitive methods.
Deep Multi-Source Visual Fusion With Transformer Model for Video Content Filtering
Senthil Murugan NagarajanGanesh Gopal DevarajanAsha Jerlin MDaniel ArockiamAli Kashif BashirMaryam M. Al Dabel
Keywords:Feature extractionTransformersFilteringEncodingData miningAccuracyComputational modelingBidirectional controlWeb sitesVideo on demandTransformer ModelVideo ContentContent FilteringText DataPrecision And RecallFusion MethodVideo ImagesContent CategoriesFusion NetworkContent ModerationAudio ContentWeight MatrixAttention MechanismVideo ClipsFeed-forward NetworkVideo AnalysisFeature MatrixHidden StateFeature FusionVideo DataAudio DataVideo FeaturesSelf-attention MechanismFrames Per SecondTransformer EncoderSexual ContentAttention ScoresTextual FeaturesMel-frequency Cepstral CoefficientsAverage FrameMulti-modal fusiontransformer modeldeep learningcontent filteringspeech enhancement
Abstracts:As YouTube content continues to grow, advanced filtering systems are crucial to ensuring a safe and enjoyable user experience. We present MFusTSVD, a multi-modal model for classifying YouTube video content by analyzing text, audio, and video images. MFusTSVD uses specialized methods to extract features from audio and video images, while processing text data with BERT Transformers. Our key innovation includes two new BERT-based multi-modal fusion methods: B-SMTLMF and B-CMTLRMF. These methods combine features from different data types and improve the model's ability to understand each type of data, including detailed audio patterns, leading to better content classification and speech-related separation. MFusTSVD is designed to perform better than existing models in terms of accuracy, precision, recall, and F-measure. Tests show that MFusTSVD consistently outperforms popular models like Memory Fusion Network, Early Fusion LSTM, Late Fusion LSTM, and multi-modal Transformer across different content types and evaluation measures. In particular, MFusTSVD effectively balances precision and recall, which makes it especially useful for identifying inappropriate speech and audio content, as well as broader categories, ensuring reliable and robust content moderation.
Enhanced Multimodal Speech Processing for Healthcare Applications: A Deep Fusion Approach
Jianhui LvWadii BoulilaShalli RaniHuamao Jiang
Keywords:Medical servicesSpeech enhancementNoiseMedical diagnostic imagingVisualizationAdaptation modelsDeep learningArtificial intelligenceAcousticsTrainingSpeech ProcessingHealthcare ApplicationsLoss FunctionAcousticHealthcare SettingsTelemedicineFusion MethodMedical TermsMultimodal MethodsMedical CommunicationModel PerformanceDeep LearningFacial ExpressionsVisual FeaturesAttention MechanismBilinear InterpolationHealthcare EnvironmentSelf-supervised LearningMedical ContextNoise ConditionsLip MovementsAudio QualitySpeech QualitydB Signal-to-noise Ratio3D Convolutional LayersMedical ScenariosTemporal AlignmentWiener FilterChannel AttentionMedical QualityDeep learningmultimodal speech enhancementhealthcare audio-visual deep fusionconvolutional neural network
Abstracts:Communication in healthcare settings is sometimes affected by ambient noise, resulting in possible misunderstanding of essential information. We introduce the healthcare audio-visual deep fusion (HAV-DF) model, an innovative method that improves speech comprehension in clinical environments by intelligently merging acoustic and visual data. The HAV-DF model has three key advancements. First, it utilizes a medical video interface that collects nuanced visual signals pertinent to medical communication. Then, it employs an advanced multimodal fusion method that adaptively modifies the integration of auditory and visual data in response to noisy situations. Finally, it employs an innovative loss function that integrates healthcare-specific indicators to increase voice optimization for medical applications. Experimental findings on the MedDialog and MedVidQA datasets illustrate the efficacy of the proposed model efficacy under diverse noise situations. In low SNR situations (−5dB), HAV-DF attains a PESQ score of 2.45, indicating a 25% enhancement compared to leading approaches. The model achieves a medical term preservation rate of 93.18% under difficult acoustic settings, markedly surpassing current methodologies. These enhancements provide more dependable communication across many therapeutic contexts, from emergency departments to telemedicine consultations.
Guest Editorial: IEEE JSTSP Special Issue on Deep Multimodal Speech Enhancement and Separation (DEMSES)
Amir HussainYu TsaoJohn H.L. HansenNaomi HarteShinji WatanabeIsabel TrancosoShixiong Zhang
Keywords:Special issues and sectionsSpeech enhancementMultisensory integrationVisualizationContext modelingSpeech processingData miningTrainingSpeech synthesisGuest EditorialSpeech EnhancementMultimodal SpeechTransformerContextual InformationVisual CuesState-space ModelSpeech ProcessingVideo ContentSelf-supervised LearningMultimodal ModelAudiovisual Speech
IEEE Signal Processing Society Publication Information
Hot Journals