-
Ergodic Imitation With Corrections: Learning From Implicit Information in Human Feedback
Junru PangQuentin Anderson-WatsonKathleen Fitzsimons
Keywords:TrajectoryForceDeformationCollision avoidanceImitation learningTrainingOptimal controlAdaptive systemsHuman-robot interactionCollaborative robotsHuman FeedbackPhysical InteractionPerformance MetricsUser StudyCommercial ApplicationsHuman-robot InteractionRobot BehaviorInverse Reinforcement LearningTime InteractionLearning FrameworkAspects Of BehaviorTypes Of ErrorsOnline LearningInteraction ForcesJoint AnglesRobotic ArmEnd-effectorDistribution Of TasksImitation LearningTask DefinitionOnline UpdateImpedance ControlRobot TrajectoryOnline CorrectionTask FailureTest FixtureNominal TrajectoryRobot LearningNegative AspectsAdaptive systemhuman–robot interactionrobotics
Abstracts:As the prevalence of collaborative robots increases, physical interactions between humans and robots are inevitable—presenting an opportunity for robots to not only maintain safe working parameters with humans but also learn from these interactions. To develop adaptive robots, we first aim to analyze human responses to different errors through a study in which users are asked to correct any errors that the robot makes in various tasks. With this characterization of corrections, we can treat physical human–robot interactions as informative instead of ignoring physical interactions or leaving robots to return to the originally planned behaviors when interactions end. We incorporate physical corrections into existing learning from demonstration (LfD) frameworks, which allow robots to learn new skills by observing human demonstrations. We demonstrate that learning from physical interactions can improve task-specific performance metrics. The results reveal that including information about the behavior being corrected in the update improves task performance significantly compared to adding corrected trajectories alone. In a user study with an optimal control-based LfD framework, we also find that users are able to provide less feedback to the robot after each interaction update to the robot’s behavior. Utilizing corrections could enable advanced LfD techniques to be integrated into commercial applications for collaborative robots by enabling end-users to customize the robot’s behavior through intuitive interactions rather than by modifying the behavior in software.
-
Understanding and Predicting Temporal Visual Attention Influenced by Dynamic Highlights in Monitoring Task
Zekun WuAnna Maria Feit
Keywords:VisualizationMonitoringDronesPredictive modelsReal-time systemsImage color analysisGraphical user interfacesTimingVisual AttentionMonitoring TaskDynamic ChangesTemporal FeaturesSituational AwarenessGaze BehaviorDynamic InterfaceUser AttentionSaliency ModelsPrediction AccuracyConvolutional Neural NetworkSpatial FeaturesFixed PointVisual CuesEye-trackingTemporal InformationSpatial ProcessingAttentional OrientingCritical SituationsSaliency MapGaze DataFixation DurationFixation CountTemporal PredictionRelative SalienceBottom-up ModelTransformer EncoderEngagement MetricsBattery LevelDynamic highlightgaze behavior analysisvisual attentionvisual saliency
Abstracts:Monitoring interfaces are crucial for dynamic, high-stakes tasks where effective user attention is essential. Visual highlights can guide attention effectively, but may also introduce unintended disruptions. To investigate this, we examined how visual highlights affect users’ gaze behavior in a drone monitoring task, focusing on when, how long, and how much attention they draw. We found that highlighted areas exhibit distinct temporal characteristics compared to nonhighlighted ones, quantified using normalized saliency (NS) metrics. We found that highlights elicited immediate responses, with NS peaking quickly, but this shift came at the cost of reduced search efforts elsewhere, potentially impacting situational awareness. To predict these dynamic changes and support interface design, we developed the Highlight-Informed Saliency Model, which provides granular predictions of NS over time. These predictions enable evaluations of highlight effectiveness and inform the optimal timing and deployment of highlights in future monitoring interface designs, particularly for time-sensitive tasks.
-
2025 Index IEEE Transactions on Human-Machine Systems
-
SHA-SCP: A UI Element Spatial Hierarchy Aware Smartphone User Click Behavior Prediction Method
Ling ChenQian ChenYiyi PengKai QianHongyu ShiXiaofan Zhang
Keywords:Data collectionPredictive modelsAttention mechanismsAccuracyUser interfacesTransformersImage color analysisData privacyComplexity theorySmart phonesPredictor Of BehaviorInterface ElementsUser ClicksClick BehaviorBehavior Prediction MethodAttention MechanismUser BehaviorTop-1 AccuracyImprove User ExperienceTop-5 AccuracyContralateralTraining SetTransformerSupport Vector MachineContextual InformationTraining TimeData PrivacyRecurrent Neural NetworkMultilayer PerceptronRepresentative ElementNumber Of HeadsFeed-forward NetworkPosition EmbeddingTraditional Machine Learning MethodsTextural PropertiesSpatial AwarenessSequence ElementsGroup ElementsHuman-computer InteractionClick behavior predictionspatial hierarchy awarenessuser interface
Abstracts:Predicting user click behavior and making relevant recommendations based on the user’s historical click behavior are critical to simplifying operations and improving user experience. Modeling User Interface (UI) elements is essential to user click behavior prediction, while the complexity and variety of the UI make it difficult to adequately capture the information of different scales. In addition, the lack of relevant datasets also presents difficulties for such studies. In response to these challenges, we construct a fine-grained smartphone usage behavior dataset containing 3 664 325 clicks of 100 users and propose a UI element Spatial Hierarchy Aware Smartphone user Click behavior Prediction method (SHA-SCP). SHA-SCP builds element groups by clustering the elements according to their spatial positions and uses attention mechanisms to perceive the UI at the element level and the element group level to fully capture the information of different scales. Experiments are conducted on the fine-grained smartphone usage behavior dataset, and the results show that our method outperforms the best baseline by an average of 18.35$\%$, 13.86$\%$, and 11.97$\%$ in Top-1 Accuracy, Top-3 Accuracy, and Top-5 Accuracy, respectively.
-
A Human–Machine Cooperative Control Strategy Based on Deep Reinforcement Learning to Enhance Heavy Vehicle Driving Safety
Han ZhangYuhan LiuLiaoyang ZhanWanzhong Zhao
Keywords:Stability criteriaSafetyControl systemsWheelsSteering systemsBrakesHuman-machine systemsCooperative systemsAdvanced driver assistance systemsAlgorithm design and analysisMulti-agent systemsControl StrategyRoad SafetyDeep Reinforcement LearningCooperative ControlCooperative Control StrategyControl SystemOptimal ControlControl SequenceModel Predictive ControlMulti-agent SystemsPlanning PhasePareto OptimalReward FunctionVehicle SafetyAdvanced Driver Assistance SystemsDriver CharacteristicsActor NetworkIntelligence TechnologyFuzzy ControlYaw RateSideslip AngleYaw MomentSteering AngleTyre ModelVehicle StabilityTypes Of DriversFront WheelVertical LoadVehicle DynamicsActive front steering (AFS)differential braking control (DBC)deep deterministic policy gradient (DDPG)human–machine cooperative controlPareto optimality theory
Abstracts:As heavy vehicles advance toward increased intelligence and modernization, the control of advanced driver assistance systems for ensuring driving safety faces significant challenges. To enhance the driving safety of heavy vehicles operated by drivers with varying driving styles, this article proposes a human–machine cooperative control (HMCC) strategy that combines steering and braking using deep deterministic policy gradient (DDPG) algorithm. First, a multiagent system is adopted as the framework for the driving safety assistance control system, wherein the active front steering (AFS) system and the differential braking control system (DBC) function as subsystems. These subsystems interact through control sequence information while managing yaw and roll stability. The optimal control performance of both the AFS and DBC is ensured using a distributed model predictive controller and Pareto optimality theory. Second, to analyze different drivers’ driving styles, safety characteristic parameters were collected from multiple drivers. By analyzing the effects of drivers on yaw and roll stability, drivers were classified into three types. Furthermore, an HMCC strategy based on DDPG is designed. Phase plane constraints that consider yaw and roll stability are incorporated into the design of the DDPG reward function, training the agents to allocate cooperative control weights between the driver and the AFS and DBC controllers. Finally, the proposed control strategy’s effectiveness is validated through the electro-hydraulic compound steering and braking hardware-in-the-loop test system, demonstrating its ability to improve driving safety for different driver characteristics.
-
Emergency Motor Intention Detection Based on Unpredictable Anticipatory Activity: An EEG Study
Long ChenJiatong HeLei ZhangMinpeng XuZhongpeng WangDong Ming
Keywords:ElectroencephalographyDecodingHuman-machine systemsTime factorsElectrodesBrain-computer interfacesVirtual realityVisual perceptionSomatosensoryMotor IntentionBehavioral ResponsesNeural ActivityClassification PerformanceLinear Discriminant AnalysisTemporal FeaturesCortical ActivityVisual ObservationReal-world ScenariosState Of EmergencyCognitive EvaluationChance LevelTemporal DomainCascade ProcessMotor PreparationReadiness PotentialCommon Spatial PatternSystem Response TimeEvent-related Spectral PerturbationBrain-computer Interface TechnologyImmersive Virtual RealityTheta BandMotor ImageryAlpha BandMotor ExecutionEEG SignalsImmersive EnvironmentPremotor CortexP300 AmplitudeCanonical Correlation AnalysisBrain–computer interface (BCI)electroencephalography (EEG)emergency anticipationevent-related potentialmotor intentionmovement-related cortical potentialvirtual reality
Abstracts:Objective: Emergency anticipation (EA) refers to the brain’s rapid perceptual, cognitive, and motor preparation in response to imminent emergencies. Timely decoding of EA can facilitate proactive responses before full behavioral execution, which is critical in real-world scenarios such as avoiding hazards or mitigating accidents. However, the cortical activation underlying the EA process has not been fully explored. This study aims to analyze the neural activity of the EA process and explore the feasibility of detecting emergency motor intention in conjunction with brain-computer interface (BCI) technology. Methods: We designed a new emergency state induction paradigm in the virtual environment, including a target task (emergency anticipation, EA) and two baseline tasks (emergency anticipation execution, EAE, visual observation, VO). A total of 31 healthy subjects were recruited for the offline experiment. The cortical responses during the EA process were quantified by analyzing event-related potential, movement-related cortical potential, and event-related spectral perturbation. Discriminative canonical pattern matching, common spatial patterns, and shrinkage linear discriminant analysis were employed to perform binary classification. Six subjects participated in the pseudo-online asynchronous experiment to valid the feasibility of identifying emergency motor intention. Results: The results showed that the cascading process associated with EA existed in both the temporal and spectral domains. Particularly, temporal domain feature demonstrated superior classification performance, with averages of 90.13% (>80% chance level). The pseudo-online evaluation showed that the system response time with an average of 257.12 ms, which was 35 ms faster than the behavioral response. Significance: Our work demonstrated the cascading process of perceptual recognition, cognitive evaluation, and motor preparation during the EA processes and provided preliminary evidence supporting the feasibility of detecting emergency motor intentions. These findings lay a theoretical foundation for extending the application of BCI technology to rapid control scenarios.
-
Automotive Cockpit-Driving Integration for Human-Centric Autonomous Driving: A Survey
Zhongpan ZhuShuaijie ZhaoMobing CaiCheng WangAimin Du
Keywords:Decision makingDriver behaviorSafetyAutonomous vehiclesPrediction algorithmsSurveysNeural networksHuman vehicle systemsAlgorithm design and analysisAutonomous VehiclesNeural NetworkLifelong LearningDriver BehaviorPersonalityConvolutional Neural NetworkSupport Vector MachineTraffic CongestionLevels Of FatigueLong Short-term Memory NetworkRoad ConditionsAdaptive ThresholdVehicle DynamicsVehicle StateVehicle TrajectoryLane ChangeDriver StateLane MarkingsCatastrophic ForgettingDriver CharacteristicsDriver FatigueRelevance Vector MachineVulnerable Road UsersUncertain BehaviorRoad SafetyMemory PoolMultivariate Time SeriesMathematical ModelLong Short-term MemoryGraph Neural NetworksAutonomous vehiclescockpit-driving integration (CDI)human–vehicle interaction
Abstracts:Intelligent driving aims to handle dynamic driving tasks in complex environments, while driver behavior onboard is less focused. In contrast, an intelligent cockpit mainly focuses on interacting with a driver, with limited connection to the driving scenarios. Since the driver onboard could affect the driving strategy significantly and thus have nonnegligible safety implications on an autonomous vehicle, a cockpit-driving integration (CDI) is generally essential to take the driver’s behavior and intention into account when shaping the driving strategy. However, no comprehensive review of current existing CDI technologies is conducted despite the significant role of CDI in safe driving. Therefore, we are motivated to summarize the state-of-the-art of CDI methods and investigate the development trends of CDI. To this end, we identify thoroughly current applications of CDI for the perception and decision-making of autonomous vehicles and highlight critical issues that urgently need to be addressed. Additionally, we propose a lifelong learning framework based on evolvable neural networks as solutions for future CDI. Finally, challenges and future work are discussed. The work provides useful insights for developers regarding designing safe and human-centric autonomous vehicles.
-
EEG Neurofeedback-Based Gait Motor Imagery Training in Lokomat Enhances Motor Rhythms in Complete Spinal Cord Injury
Ericka R. da Silva SerafiniCristian D. Guerrero-MendezDouglas M. DungaTeodiano F. Bastos-FilhoAnibal Cotrina AtencioAndré F. O. de Azevedo DantasCaroline C. do Espírito SantoDenis Delisle-Rodriguez
Keywords:TrainingElectroencephalographyModulationLegged locomotionRobot sensing systemsVisualizationNeurofeedbackSpinal cord injurySpinal Cord InjuryMotor ImageryComplete Spinal Cord InjuryMotor RhythmIndividual StrainsVisual FeedbackMin WalkGait TrainingCortical PatternsMotor TrainingCortical ReorganizationWeight SupportPhysical DisabilityCohen’s KappaLinear Discriminant AnalysisBaseline ConditionIndividual FunctionsCortical ActivityEEG DataEEG SignalsAmerican Spinal Injury AssociationCortical ModulationLinear Discriminant Analysis ClassifierCommon Spatial PatternMotor LearningEEG PatternsLower Limb MovementsMotor Imagery TasksCortical ResponsesSevere Motor ImpairmentElectroencephalography (EEG)gait traininglocomotionneurofeedback (NFB)spinal cord injury (SCI)
Abstracts:Robotic interventions combining neurofeedback (NFB) and motor imagery (MI) are emerging strategies to promote cortical reorganization and functional training in individuals with complete spinal cord injury (SCI). This study proposes an electroencephalogram-based NFB approach for MI training, designed to teach the MI-related brain rhythmics modulation in Lokomat. For the purposes of this study, NFB is defined as a visual feedback training scheme. The proposed system introduces a formulation to minimize the default cortical effects that Lokomat produces on the individual’s activity during passive walking. Two individuals with complete SCI tested the proposed NFB system, in order to relearn the modulation of Mu ($\mu$ : 8–12 Hz) and Beta ($\beta$ : 13–30 Hz) rhythms over Cz, while receiving gait training with full weight support across 12 sessions. Each session consisted of the following three stages: 1) 2 min walking without MI (baseline); 2) 5 min walking with MI and True NFB; and 3) 5 min walking with MI and Sham NFB. The latter two stages were randomized session-by-session. The findings suggest that the proposed NFB approach may promote cortical reorganization and support the restoration of sensorimotor functions. Significant differences were observed between cortical patterns during True NFB and Sham NFB, particularly in the last intervention sessions. These results confirm the positive impact of the NFB system on gait motor training by enabling individuals with complete SCI to learn how to modulate their motor rhythms in specific cortical areas.
-
ST-GCN-AltFormer: Gesture Recognition With Spatial-Temporal Alternating Transformer
Qing PanJintao ZhuLingwei ZhangGangmin NingLuping Fang
Keywords:TransformersHandsGesture recognitionFeature extractionJointsSkeletonData miningAccuracyGraph convolutional networksSpatiotemporal phenomenaGesture RecognitionConvolutional NetworkSpatial InformationTemporal InformationNodes In The GraphGraph Convolutional NetworkHand GesturesGraph ConvolutionLong-range DependenciesTransformer ArchitectureSpatial-temporal InformationHand Gesture RecognitionSpatial FeaturesSpatial DimensionsTemporal FeaturesIndex FingerMultilayer PerceptronConvolution OperationSpatial DomainTemporal DomainFinger JointsSpatial-temporal FeaturesTemporal ConvolutionAdjacent NodesAdjacent JointsHand ShapeSkeleton DataHuman Activity RecognitionSelf-attention MechanismState Of The Art MethodsAlternating transformergraph convolutional network (GCN)hand gesture recognitionspatial-temporal attention
Abstracts:In skeleton-based gesture recognition tasks, existing approaches based on graph convolutional networks (GCNs) struggle to capture the synergistic actions of nonadjacent graph nodes and the information conveyed by their long-range dependencies. Combining spatial and temporal transformers is a promising solution to address the limitation, inspired by the advantage of transformer in assessing nonadjacent long-range dependencies, but there lacks an effective strategy to integrate the spatial and temporal information extracted by these transformers. Therefore, this article proposes the spatial-temporal alternating graph convolution transformer (ST-GCN-AltFormer), which connects the spatial-temporal graph convolutional network (ST-GCN) with the spatial-temporal alternating transformer (AltFormer) architecture. In the AltFormer architecture, the spatial-temporal transformer branch employs a spatial transformer to capture information from specific frames, and uses a temporal transformer to analyze its evolution over the entire temporal range. Meanwhile, the temporal-spatial transformer branch extracts temporal information from specific nodes using a temporal transformer, and integrates it with a spatial transformer. The fusion enhances accurate spatial-temporal information extraction. Our method achieves superior performance compared to state-of-the-art methods, achieving accuracies of 97.5%, 95.8%, 94.3%, 92.8%, and 98.31% on the large-scale 3D hand gesture recognition (SHREC’17 Track), Dynamic Hand Gesture 14-28 (DHG-14/28), and leap motion dynamic hand gesture (LMDHG) dynamic gesture datasets, respectively.
-
AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos
Sahibzada Adil ShahzadAmmarah HashmiYan-Tsung PengYu TsaoHsin-Min Wang
Keywords:DeepfakesVisualizationFeature extractionForgeryLipsDetectorsForensicsFacesStreaming mediaSocial networking (online)Audio-visual systemsDeepfake DetectionConvolutional NetworkVisual FeaturesFacial FeaturesVisual ModalityAcoustic FeaturesFake NewsMultimodal ModelMultimedia ContentTemporal ConvolutionTemporal Convolutional NetworkMultimodal DetectionLip RegionSpeech RecognitionDetection ModelFeed-forward NetworkEnsemble ModelLinear ClassifierInference TimeAudiovisual SpeechFeature Fusion ModuleDeep Learning-based MethodsAudio InformationLip MovementsReal VideosVisual FrameIncreasing Model ComplexitySelf-supervised LearningVideo ContentAudio-visualaudio-visual deepfake detectiondeepfake detectiondeepfakesinconsistencylip synmultimedia forensicsmultimodalityvideo forgery
Abstracts:Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pretraining for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pretrained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multiscale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.