Welcome to the IKCEST
Journal
IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE Transactions on Pattern Analysis and Machine Intelligence

Archives Papers: 1,047
IEEE Xplore
Please choose volume & issue:
Learning Dynamic Graph Embeddings With Neural Controlled Differential Equations
Tiexin QinBenjamin WalkerTerry LyonsHong YanHaoliang Li
Keywords:Differential equationsMathematical modelsRepresentation learningVectorsPredictive modelsTrajectoryGraph neural networksTopologyMessage passingRobustnessDifferential EquationsDynamic GraphNeural NetworkDynamic ModelDeep Neural NetworkTemporal EvolutionRepresentation LearningVector FieldGraph StructureDynamic RepresentationLinear GraphNode EmbeddingsPrediction ErrorStructural DynamicsHigh ErrorMean Absolute ErrorLearnable ParametersPrediction TaskRandom NetworksTime StampGraph Neural NetworksNode AttributesStatic GraphCubic InterpolationNode RepresentationsTemporal GraphHeat DiffusionIntrinsic DynamicsTemporal PredictionGraph TopologyDynamic graphembedding learninggraph neural networkcontrolled differential equations
Abstracts:This paper focuses on representation learning for dynamic graphs with temporal interactions. A fundamental issue is that both the graph structure and the nodes own their own dynamics, and their blending induces intractable complexity in the temporal evolution over graphs. Drawing inspiration from the recent progress of physical dynamic models in deep neural networks, we propose Graph Neural Controlled Differential Equations (GN-CDEs), a continuous-time framework that jointly models node embeddings and structural dynamics by incorporating a graph enhanced neural network vector field with a time-varying graph path as the control signal. Our framework exhibits several desirable characteristics, including the ability to express dynamics on evolving graphs without piecewise integration, the capability to calibrate trajectories with subsequent data, and robustness to missing observations. Empirical evaluation on a range of dynamic graph representation learning tasks demonstrates the effectiveness of our proposed approach in capturing the complex dynamics of dynamic graphs.
RealCustom++: Representing Images as Real Textual Word for Real-Time Customization
Zhendong MaoMengqi HuangFei DingMingcong LiuQian HeYongdong Zhang
Keywords:ControllabilityTrainingSemanticsOverfittingVisualizationToy manufacturing industryText to imageNavigationCross layer designAdaptive systemsText WordsReal WordsVisual ConditionsPseudowordsImage FeaturesImage GenerationReference ImageDeep FeaturesEarly StoppingHidden StateMulti-scale FeaturesMultiple SubjectsSubjective ImageRepresentative SubjectMasked ImagesTraining FrameworkShallow FeaturesDiverse TasksRegularization LossSemantic ChangeConference VersionImage EncoderPre-trained EncoderText EncoderDenoisingEmpirical ValidationLow-resolution FeatureTraining StepAttention MapFine-grained FeaturesText-to-image customizationdiffusion modelsimage generationcurriculum learning
Abstracts:Text-to-image customization aims to generate images that align with both the given text and the subject in the given image. Existing works follow the pseudo-word paradigm, which represents the subject as a non-existent pseudo word and combines it with other text to generate images. However, the pseudo word inherently conflicts and entangles with other real words, resulting in a dual-optimum paradox between the subject similarity and text controllability. To address this, we propose RealCustom++, a novel real-word paradigm that represents the subject with a non-conflicting real word to generate a coherent guidance image and corresponding subject mask, there by disentangling the influence scopes of the text and subject for simultaneous optimization. Specifically, RealCustom++ introduces a train-inference decoupled framework: (1) during training, it learns a general alignment between visual conditions and all real text words; and (2) during inference, a dual-branch architecture is employed, where the Guidance Branch produces the subject guidance mask, and the Generation Branch utilizes this mask to customize the generation of the specific real word exclusively within subject-relevant regions. Extensive experiments validate RealCustom++s superior performance, which improves controllability by 7.48%, similarity by 3.04% and quality by 76.43% simultaneously. Moreover, RealCustom++ further improves controllability by 4.6% and multi-subject similarity by 6.34% for multisubject customization
Exploiting the Benefits of Temporal Information in the Realm of LiDAR Panoptic Segmentation
Ngoc-Quan Ha-PhanMyungsik Yoo
Keywords:Laser radarPoint cloud compressionAccuracyThree-dimensional displaysSemanticsSemantic segmentationPredictive modelsProductivityObject detectionDistance measurementTemporal InformationLight Detection And RangingPanoptic SegmentationPoint CloudFinal PredictionAutonomous VehiclesFinal SegmentationSegmentation PredictionPedestrianConvolution OperationSemantic SegmentationFeature PointsFusion MethodSegmentation PerformanceMultiple FramesAttention MapInstance Segmentation3D SegmentationAdvanced Driver Assistance Systems3D Object DetectionGrid FeaturesRecognition QualitySemantic PredictionOffset VectorBenchmark EvaluationBackbone FeatureTarget FrameUnsupervised Clustering MethodPromising PerformanceLiDARLiDAR panoptic segmentation (LPS)multi-frame processingtemporal datavehicle perception
Abstracts:LiDAR perception for autonomous driving applications offers highly accurate scene depiction in three-dimensional (3D) spaces, whose most representative task is LiDAR panoptic segmentation (LPS), as it offers exhibition of both instance- and semantic-level segmentation in a holistic manner. Although previous approaches have achieved mature performance, no research has explored temporal information for enhancing LPS performance. As multi-frame processing can assist in better predictions in terms of feature representation and recursive forecasting, which has been proven in other LiDAR perception challenges, this study proposes an effective and temporal-aware panoptic segmentation method for LiDAR point clouds. Specifically, we introduce two modules: convolution-based cross-frame fusion attention (CFFA) and adjacent shifted feature encoder (ASFE) modules. The CFFA module can fuse multi-frame features on the basis of the idea of convolution-based attention, whereas the ASFE module leverages adjacent model outputs and serves as an intermediate guide for final segmentation predictions. Consequent to our extensive experiments, the two modules have been reaffirmed in terms of their productivity in the realm of the LPS. The proposed LPS model achieves impressive panoptic-quality metric scores that are evaluated on different popular benchmarks (63.36% under SemanticKITTI and 78.54% under Panoptic nuScenes), outperforming previous state-of-the-art methods by a significant margin. Further quantitative and qualitative analyses provide evidence of the advantages of multi-frame processing for the LPS together with demonstrations of its particular behavior under different settings.
High-Resolution Open-Vocabulary Object 6D Pose Estimation
Jaime CorsettiDavide BoscainiFrancesco GiuliariChangjae OhAndrea CavallaroFabio Poiesi
Keywords:Pose estimationThree-dimensional displaysSolid modelingFeature extractionDetectorsVideo sequencesTranslationCamerasTrainingObject recognitionPose Estimation6D Pose6D Pose Estimation6D Object PoseMulti-scale FeaturesAverage RecallHigh-resolution FeaturesHuman Pose EstimationObjective EstimatesRelative PoseUnseen ObjectsNatural Language DescriptionsFeature SpaceFeature MapsPoint CloudBounding BoxImage PairsSurgical MarginsObject Of InterestObject PoseAccurate PoseObjective DescriptionCamera Intrinsic ParametersPose Estimation MethodsText EncoderCamera PoseObject NamingHigh-resolution Feature MapsObject 6D pose estimationopen-vocabulary
Abstracts:The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.
Vicinal Gaussian Transform: Rethinking Source-Free Domain Adaptation Through Source-Informed Label Consistency
Jing WangYongchao XuJing TangZeyu GongBo TaoClarence W. de SilvaXiang Bai
Keywords:Adaptation modelsData modelsTransformsPredictive modelsTrainingStochastic processesTrajectoryNeural networksMathematical modelsBridgesDomain AdaptationLabel ConsistencySource-free Domain AdaptationData SourcesDenoisingEnergy Function2D ImagesDomain ShiftStochastic Differential Equations3D Point CloudAblationTransition StateBrownian MotionLatent SpaceStatistical EvidenceTarget FeaturesTarget DomainTarget DataDecision BoundarySource DomainSource ClassQuery FeaturesEnergy-based ModelUnlabeled Target DataNegative GradientCovariate ShiftSemantic ChangeSelf-supervised LearningSynthetic PointLocal WeightsSource-free domain adaptationenergy-based modellatent diffusion modelconsistency model
Abstracts:A central challenge in source-free domain adaptation (SFDA) is the lack of a theoretical framework for explicitly analyzing domain shifts, as the absence of source data prevents direct domain comparisons. In this paper, we introduce the Vicinal Gaussian Transform (VGT), an analytical operator that models source-informed latent vicinities as Gaussians and shows that vicinal prediction divergence is bounded by their covariance. By this formulation, SFDA can be reframed as shrinking covariance to reinforce label consistency. To operationalize this idea, we introduce the Energy-based VGT (EBVGT), a novel SDE that realizes the Gaussian transform by contracting covariance through a denoising mechanism. A recovery-likelihood with a Schrödinger-Bridge smoothness penalty denoises perturbed states, while a BYOL-derived energy function, directly obtained from model predictions, provides the score to guide label-consistent trajectories within the vicinity. This design not only yields noise-suppressed vicinal features for adaptation without source data, but also eliminates the need for additional learnable parameters for score estimation, in contrast to conventional deep SDEs. Our EBVGT is model- and modality-agnostic, efficient for classification, and improves state-of-the-art SFDA methods by 1.3–3.0% (2.0% on average) across both 2D image and 3D point cloud benchmarks.
Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
Min CaoXinyu ZhouDing JiangBo DuMang YeMin Zhang
Keywords:MultilingualTranslationCognitionBenchmark testingVisualizationRepresentation learningPipelinesNoise measurementFootwearFeature extractionRelational ReasoningPerson RetrievalLanguage ModelTextual DescriptionsGlobal AlignmentMasked ImagesDomain-specific KnowledgeAlignment StrategyImplicit RelationsVisual InformationFeature RepresentationLarge ModelTextual InformationImage RepresentationStudent ModelSource TextText RepresentationGlobal RepresentationFine-grained InformationRetrieval PerformanceTarget TextPretext TaskMasked Language ModelCoarse-grained LevelImage EncoderFine-grained LevelText QueryDiscriminative Feature RepresentationText EncoderPerson ImageText-to-image person retrievalmultilingual image-text learningperson re-identification
Abstracts:Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets.
SMC++: Masked Learning of Unsupervised Video Semantic Compression
Yuan TianXiaoyue LingCong GengQiang HuGuo LuGuangtao Zhai
Keywords:SemanticsVideosEncodingImage codingTransformersArtificial intelligenceVisualizationAnalytical modelsAdaptation modelsVideo compressionBlueprintHeterogeneous CharacteristicsVideo AnalysisSemantic RepresentationsHuman VisionCompression MethodMotion PredictionVideo CompressionTransformerDecodingMutual InformationSemantic InformationLearning ObjectivesVisual QualitySemantic FeaturesAction RecognitionOptical FlowSegmentation TaskSemantic TaskPrevious FrameMultiple Object TrackingTop-1 AccuracyOriginal VideoAction Recognition TaskFeature AlignmentEncoder NetworkLeast Significant BitFlow MapLearning ProcedureMotion InformationVideo compressionmasked image/video modelingvideo action recognition
Abstracts:Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.
Affine Correspondences Between Multi-Camera Systems for Relative Pose Estimation
Banglei GuanJi Zhao
Keywords:Pose estimationCamerasAccuracyRobot vision systemsTranslationUncertaintyThree-dimensional displaysSimultaneous localization and mappingRobotsOptimization methodsPose EstimationRelative PoseMulti-camera SystemRelative Pose EstimationSpecific ParametersVertical DirectionRelated ProblemsRotation AngleInertial Measurement UnitGeometric ConstraintsRelative AngleRandom Sample ConsensusMinimal SolutionFirst ApproximationImage PairsAutonomous VehiclesAffine TransformationNumerical StabilityPitch AngleRoll AngleExtra ConstraintsRotation ErrorPerformance Of SolverPerspective CameraMetric ScaleKITTI DatasetDepth ParameterSimultaneous Localization And MappingInliersRotation ParametersRelative pose estimationmulti-camera systemaffine correspondenceminimal solverrotation angle
Abstracts:We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem—where the scale transformation of multi-camera systems is unknown—we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy.
Causal Interventional Prompt Tuning for Few-Shot Out-of-Distribution Generalization
Jie WenYicheng LiuChao HuangChengliang LiuYong XuXiaochun Cao
Keywords:TrainingAdaptation modelsCorrelationFeature extractionTuningBirdsVisualizationRobustnessImage recognitionData modelsSemanticInput ImageGeneralization AbilityCausal InferenceJoint EffectFine-tuned ModelTraining DistributionCausal EffectImage FeaturesClass LabelsAverage PerformanceImageNetDomain ShiftLatent SpaceBase ClassesDomain GeneralizationAttention MapNamed Entity RecognitionTrue CauseUnobserved ConfoundersText EncoderImage EncoderTrue Causal EffectNumber Of TemplatesVariety Of TextsFine-tuning MethodLearning VectorDataset IDCLIPFew-shot classificationCausal inferenceOut-of-Distribution generalization
Abstracts:Fine-tuning pre-trained vision-language models (VLMs) has shown substantial benefits in a wide range of downstream tasks, often achieving impressive performance with minimal labeled data. Parameter-efficient fine-tuning techniques, in particular, have demonstrated their effectiveness in enhancing downstream task performance. However, these methods frequently struggle to generalize to out-of-distribution (OOD) data due to their reliance on non-causal representations, which can introduce biases and spurious correlations that negatively impact decision-making. Such spurious factors hinder the model’s generalization ability beyond the training distribution. To address these challenges, in this paper, we propose a novel causal intervention-based prompt tuning method to adapt VLMs to few-shot OOD generalization. Specifically, we leverage the front-door adjustment technique from causal inference to mitigate the effects of spurious correlations and enhance the model’s focus on causal relationships. Built upon VLMs, our approach begins by decoupling causal and non-causal representations in the vision-language alignment process. The causal representation that captures only essential semantically relevant information can serve as a mediator variable between the input image and output label, mitigating the biases from the latent confounder. To further enrich this causal representation, we propose a novel text-based diversity augmentation technique that uses textual features to provide additional semantic context. This augmentation technique can enhance the diversity of the causal representation, making it more robust and generalizable to various OOD scenarios. Experimental results across multiple OOD datasets demonstrate that our method significantly outperforms existing approaches, achieving state-of-the-art generalization performance.
Optimizing Unnormalized Statistical Models Through Compositional Optimization
Wei JiangJiayu QinLingyu WuChangyou ChenTianbao YangLijun Zhang
Keywords:NoiseTrainingOptimizationStochastic processesConvergenceComplexity theoryMaximum likelihood estimationComputational modelingProbability density functionMonte Carlo methodsOptimization Of CompositionConvergence RateDensity EstimationStochastic OptimizationPartition FunctionSlow ConvergenceStochastic AlgorithmNoise DistributionGaussian ApproximationArtificial NoiseFlat LandscapeEnergy-based ModelLogistic LossTraining SetData DistributionObjective FunctionMonte Carlo SimulationLearning RateHyperparametersMaximum Likelihood EstimationMarkov Chain Monte CarloConvex FunctionMaximum Mean DiscrepancyLipschitz ContinuousNumber Of StepsFréchet Inception DistanceRestricted Boltzmann MachineMomentum ParameterError Correction TermLangevin DynamicsNoise-contrastive estimationstochastic compositional optimizationunnormalized statistical modelsparameter-free algorithmsmax likelihood estimation
Abstracts:Learning unnormalized statistical models (e.g., energy-based models) is computationally challenging due to the complexity of handling the partition function. To eschew this complexity, noise-contrastive estimation (NCE) has been proposed by formulating the objective as the logistic loss between the real data and the artificial noise. However, previous research indicates that NCE may perform poorly in many tasks due to its flat loss landscape and slow convergence. In this paper, we study a direct approach for optimizing the negative log-likelihood of unnormalized models through the lens of compositional optimization. To tackle the partition function, a noise distribution is introduced such that the log partition function can be expressed as a compositional function whose inner function can be estimated using stochastic samples. Consequently, the objective can be optimized via stochastic compositional optimization algorithms. Despite being a simple method, we demonstrate it is more favorable than NCE by (1) establishing a fast convergence rate and quantifying its dependence on the noise distribution through the variance of stochastic estimators; (2) developing better results in Gaussian mean estimation by showing our method has a much favorable loss landscape and enjoys faster convergence; (3) demonstrating better performance on various applications, including density estimation, out-of-distribution detection, and real image generation.
Hot Journals