Welcome to the IKCEST
Journal
IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE Transactions on Pattern Analysis and Machine Intelligence

Archives Papers: 742
IEEE Xplore
Please choose volume & issue:
Towards Understanding Convergence and Generalization of AdamW
Pan ZhouXingyu XieZhouchen LinShuicheng Yan
Keywords:ConvergenceComplexity theoryOptimizationAdaptive algorithmsTrainingBayes methodsUpper boundAdaptive AlgorithmWeight DecayStochastic GradientNetwork WeightsOptimization StepNon-convex ProblemGeneralization ErrorSecond-order MomentsLearning RateDeep NetworkConvergence RateLocal MinimaSingular ValueImageNetGeneralization PerformanceAlexNetResult Of TheoremStochastic Differential EquationsError BoundsLeast Squares ProblemLearning Rate DecayConstant Learning RateVision TransformerFaster Convergence SpeedLearning Rate SetFully-connected NetworkAccurate PointAdaptive gradient algorithmsanalysis of AdamWconvergence of AdamWgeneralization of AdamW
Abstracts:AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used $\ell _{2}$ℓ2-regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and $\ell _{2}$ℓ2-regularized Adam ($\ell _{2}$ℓ2-Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and $\ell _{2}$ℓ2-Adam. Specifically, AdamW provably converges but minimizes a dynamically regularized loss that combines vanilla loss and a dynamical regularization induced by decoupled weight decay, thus yielding different behaviors with Adam and $\ell _{2}$ℓ2-Adam. Moreover, on both general nonconvex problems and PŁ-conditioned problems, we establish stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and $\ell _{2}$ℓ2-Adam, and improves their previously known complexity, especially for over-parametrized networks. Besides, we prove that AdamW enjoys smaller generalization errors than Adam and $\ell _{2}$ℓ2-Adam from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of decoupled weight decay in AdamW. Experimental results validate our theory.
EGCN++: A New Fusion Strategy for Ensemble Learning in Skeleton-Based Rehabilitation Exercise Assessment
Bruce X. B. YuYan LiuKeith C. C. ChanChang Wen Chen
Keywords:SkeletonHidden Markov modelsData modelsEnsemble learningConvolutional neural networksTrainingThree-dimensional displaysFusion StrategyExtensive ExperimentsRating ScoresExercise PerformanceGraph Convolutional NetworkPosition FeaturesPreferred CharacteristicsGraph ConvolutionDecision LevelEnsemble StrategySkeleton DataAlzheimer’s DiseaseConvolutional Neural NetworkConvolutional LayersPredictive AbilitySpatial DimensionsConvolution OperationEnsemble MethodAction RecognitionData FusionGraph Convolutional Network ModelSub-modelsInertial Measurement UnitRuntime AnalysisPredictive Ability Of The ModelVerticesGlobal Average Pooling LayerAction Recognition TaskKinect V2Numerical ScoreHuman action evaluationmodel-based fusionensemble learningHumansNeural Networks, ComputerAlgorithmsImage Processing, Computer-AssistedMachine LearningDatabases, Factual
Abstracts:Skeleton-based exercise assessment focuses on evaluating the correctness or quality of an exercise performed by a subject. Skeleton data provide two groups of features (i.e., position and orientation), which existing methods have not fully harnessed. We previously proposed an ensemble-based graph convolutional network (EGCN) that considers both position and orientation features to construct a model-based approach. Integrating these types of features achieved better performance than available methods. However, EGCN lacked a fusion strategy across the data, feature, decision, and model levels. In this paper, we present an advanced framework, EGCN++, for rehabilitation exercise assessment. Based on EGCN, a new fusion strategy called MLE-PO is proposed for EGCN++; this technique considers fusion at the data and model levels. We conduct extensive cross-validation experiments and investigate the consistency between machine and human evaluations on three datasets: UI-PRMD, KIMORE, and EHE. Results demonstrate that MLE-PO outperforms other EGCN ensemble strategies and representative baselines. Furthermore, the MLE-PO's model evaluation scores are more quantitatively consistent with clinical evaluations than other ensemble strategies.
Create Your World: Lifelong Text-to-Image Diffusion
Gan SunWenqi LiangJiahua DongJun LiZhengming DingYang Cong
Keywords:Task analysisDogsComputational modelingSemanticsTrainingNeural networksElectronic mailDiffusion ModelImage GenerationMemory EnhancementEnhancement ModuleCatastrophic ForgettingPrior ConceptionsNeural NetworkComputational CostShort-term MemoryPrevious KnowledgeImage PairsEarly FormationResults In FigLifelong LearningConcept Of LearningIncremental LearningSpecific ConceptsLarge-scale ModelsComputer Vision ApplicationsMultiple ConceptsMemory BankLatent CodeDiffusion ProblemTarget ConceptDistribution Of AttentionModel In SectionSingular Value DecompositionLearning RateLatent SpaceContinual learningimage generationlifelong machine learningstable diffusion
Abstracts:Text-to-image generative models can produce diverse high-quality images of concepts with a text prompt, which have demonstrated excellent ability in image generation, image translation, etc. We in this work study the problem of synthesizing instantiations of a user's own concepts in a never-ending manner, i.e., create your world, where the new concepts from user are quickly learned with a few examples. To achieve this goal, we propose a Lifelong text-to-image Diffusion Model (L $^{2}$2 DM), which intends to overcome knowledge “catastrophic forgetting” for the past encountered concepts, and semantic “catastrophic neglecting” for one or more concepts in the text prompt. In respect of knowledge “catastrophic forgetting”, our L $^{2}$2 DM framework devises a task-aware memory enhancement module and an elastic-concept distillation module, which could respectively safeguard the knowledge of both prior concepts and each past personalized concept. When generating images with a user text prompt, the solution to semantic “catastrophic neglecting” is that a concept attention artist module can alleviate the semantic neglecting from concept aspect, and an orthogonal attention module can reduce the semantic binding from attribute aspect. To the end, our model can generate more faithful image across a range of continual text prompts in terms of both qualitative and quantitative metrics, when comparing with the related state-of-the-art models.
Gaussian Process-Gated Hierarchical Mixtures of Experts
Yuhao LiuMarzieh AjirakPetar M. Djurić
Keywords:KernelDecision treesBayes methodsGlobal Positioning SystemRegression tree analysisArtificial neural networksVegetationMixture Of ExpertsBenchmarkNeural NetworkDeep Neural NetworkGaussian ProcessInput SpaceRandom FeatureVariational InferenceBayesian Neural NetworkGate ModelObjective FunctionDecision TreeClassification TaskBinary ClassificationHidden LayerFeature SpaceTraining TimeWelch’s T-testKullback-LeiblerRadial Basis FunctionTree HeightRadial Basis Function KernelInner NodesLeaf NodeInput DimensionCovariance FunctionTree StructureHidden VariablesKernel TypeClassification DatasetsGaussian processesmixtures of expertssoft decision treesand random features
Abstracts:In this article, we propose novel Gaussian process-gated hierarchical mixtures of experts (GPHMEs). Unlike other mixtures of experts with gating models linear in the input, our model employs gating functions built with Gaussian processes (GPs). These processes are based on random features that are non-linear functions of the inputs. Furthermore, the experts in our model are also constructed with GPs. The optimization of the GPHMEs is performed by variational inference. The proposed GPHMEs have several advantages. They outperform tree-based HME benchmarks that partition the data in the input space, and they achieve good performance with reduced complexity. Another advantage is the interpretability they provide for deep GPs, and more generally, for deep Bayesian neural networks. Our GPHMEs demonstrate excellent performance for large-scale data sets, even with quite modest sizes.
HIRI-ViT: Scaling Vision Transformer With High Resolution Inputs
Ting YaoYehao LiYingwei PanTao Mei
Keywords:TransformersConvolutionConvolutional neural networksComputational efficiencySpatial resolutionVisualizationTask analysisHigh-resolutionHigh InputInput ResolutionVision TransformerHigh-resolution InputComputational CostConvolutional Neural NetworkConvolution OperationVision TasksParallel BranchesTop-1 AccuracyHeavy Computational CostFeature MapsObject DetectionSemantic SegmentationField Of Computer VisionTeacher NetworkInstance SegmentationV100 GPUStudent NetworkTransformer BlockConvolutional Neural Networks BackboneCOCO DatasetLow-resolution InputStrided ConvolutionGlobal InteractionMask R-CNNInductive BiasDepthwise ConvolutionMulti-head Self-attentionImage recognitionself-attention learningvision transformer
Abstracts:The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model capacity, but inevitably suffers from heavy computational cost that scales quadratically. Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs. HIRI-ViT is built upon the seminal idea of decomposing the typical CNN operations into two parallel CNN branches in a cost-efficient manner. One high-resolution branch directly takes primary high-resolution features as inputs, but uses less convolution operations. The other low-resolution branch first performs down-sampling and then utilizes more convolution operations over such low-resolution features. Experiments on both recognition task (ImageNet-1K dataset) and dense prediction tasks (COCO and ADE20 K datasets) demonstrate the superiority of HIRI-ViT. More remarkably, under comparable computational cost ($\sim$∼5.0 GFLOPs), HIRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448×448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224×224 inputs.
EvHandPose: Event-Based 3D Hand Pose Estimation With Sparse Supervision
Jianping JiangJiahe LiBaowen ZhangXiaoming DengBoxin Shi
Keywords:Pose estimationCamerasThree-dimensional displaysAnnotationsStreaming mediaOptical flowShapePose EstimationHand PoseHand Pose Estimation3D Hand Pose EstimationEvent-based 3DLarge-scale DatasetsReal-world DatasetsStrong LightHigh Dynamic RangeMotion InformationFast MotionRepresentation Of EventsHand GesturesHand MotionHuman Pose EstimationOutdoor ScenesDomain Gap3D PoseCamera TypeEvent StreamRGB CameraOptical FlowHand ShapePose ParametersEvent Frames3D JointMotor RepresentationsEdge FeaturesChallenging ScenariosHand Movements3D hand pose estimationevent camera
Abstracts:Event camera shows great potential in 3D hand pose estimation, especially addressing the challenges of fast motion and high dynamic range in a low-power way. However, due to the asynchronous differential imaging mechanism, it is challenging to design event representation to encode hand motion information especially when the hands are not moving (causing motion ambiguity), and it is infeasible to fully annotate the temporally dense event stream. In this paper, we propose EvHandPose with novel hand flow representations in Event-to-Pose module for accurate hand pose estimation and alleviating the motion ambiguity issue. To solve the problem under sparse annotation, we design contrast maximization and hand-edge constraints in Pose-to-IWE (Image with Warped Events) module and formulate EvHandPose in a weakly-supervision framework. We further build EvRealHands, the first large-scale real-world event-based hand pose dataset on several challenging scenes to bridge the real-synthetic domain gap. Experiments on EvRealHands demonstrate that EvHandPose outperforms previous event-based methods under all evaluation scenes, achieves accurate and stable hand pose estimation with high temporal resolution in fast motion and strong light scenes compared with RGB-based methods, generalizes well to outdoor scenes and another type of event camera, and shows the potential for the hand gesture recognition task.
FeatAug-DETR: Enriching One-to-Many Matching for DETRs With Feature Augmentation
Rongyao FangPeng GaoAojun ZhouYingjie CaiSi LiuJifeng DaiHongsheng Li
Keywords:DetectorsTrainingTransformersFeature extractionDecodingConvergencePose estimationFeature AugmentationDEtection TRansformerImage FeaturesConvergence RateDetection PerformanceObject DetectionData AugmentationVersion Of ImageTraining BatchImage AugmentationSlow Convergence SpeedFeature MapsBounding BoxOriginal FeaturesTraining EpochsMulti-scale FeaturesPose EstimationRandom FlippingNon-maximum SuppressionPredicted Bounding BoxTransformer DecoderResNet-50 BackboneTransformer EncoderMatching LossGround-truth BoxGround Truth Object3D Detection3D Object DetectionBipartite MatchingShort EpochsDetection transformerone-to-one matchingone-to-many matchingdata augmentationaccelerating training
Abstracts:One-to-one matching is a crucial design in DETR-like object detection frameworks. It enables the DETR to perform end-to-end detection. However, it also faces challenges of lacking positive sample supervision and slow convergence speed. Several recent works proposed the one-to-many matching mechanism to accelerate training and boost detection performance. We revisit these methods and model them in a unified format of augmenting the object queries. In this paper, we propose two methods that realize one-to-many matching from a different perspective of augmenting images or image features. The first method is One-to-many Matching via Data Augmentation (denoted as DataAug-DETR). It spatially transforms the images and includes multiple augmented versions of each image in the same training batch. Such a simple augmentation strategy already achieves one-to-many matching and surprisingly improves DETR's performance. The second method is One-to-many matching via Feature Augmentation (denoted as FeatAug-DETR). Unlike DataAug-DETR, it augments the image features instead of the original images and includes multiple augmented features in the same batch to realize one-to-many matching. FeatAug-DETR significantly accelerates DETR training and boosts detection performance while keeping the inference speed unchanged. We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and $\mathcal {H}$H-Deformable-DETR. Without extra training data, FeatAug-DETR shortens the training convergence periods of Deformable-DETR (Zhu et al. 2020) to 24 epochs and achieves 58.3 AP on COCO val2017 set with Swin-L as the backbone.
Quadratic Matrix Factorization With Applications to Manifold Learning
Zheng ZhaiHengchao ChenQiang Sun
Keywords:Manifold learningManifoldsMinimizationConvergenceFittingApproximation algorithmsTask analysisMatrix FactorizationRegularization ParameterGeometric StructureReal-world DatasetsPerformance Of MethodLinear FunctionComputational ComplexityDenoisingGradient DescentQuadratic FunctionLatent SpaceNoisy DataProof Of PropositionSubset Of DatasetNon-negative Matrix FactorizationFitting ErrorNearest PointVariational AutoencoderQuadratic ProblemStrongly ConvexTuning MethodPerformance Of Different AlgorithmsMNIST DatasetProjection MatrixImage DatasetGenerative Adversarial NetworksEuclidean SpaceUnit SphereAffine SpaceAlternating minimizationconvergence propertymanifold learningquadratic matrix factorization
Abstracts:Matrix factorization is a popular framework for modeling low-rank data matrices. Motivated by manifold learning problems, this paper proposes a quadratic matrix factorization (QMF) framework to learn the curved manifold on which the dataset lies. Unlike local linear methods such as the local principal component analysis, QMF can better exploit the curved structure of the underlying manifold. Algorithmically, we propose an alternating minimization algorithm to optimize QMF and establish its theoretical convergence properties. To avoid possible over-fitting, we then propose a regularized QMF algorithm and discuss how to tune its regularization parameter. Finally, we elaborate how to apply the regularized QMF to manifold learning problems. Experiments on a synthetic manifold learning dataset and three real-world datasets, including the MNIST handwritten dataset, a cryogenic electron microscopy dataset, and the Frey Face dataset, demonstrate the superiority of the proposed method over its competitors.
Improving Fast Adversarial Training With Prior-Guided Knowledge
Xiaojun JiaYong ZhangXingxing WeiBaoyuan WuKe MaJue WangXiaochun Cao
Keywords:TrainingRobustnessGlass boxStandardsPerturbation methodsFatsComputational modelingAdversarial TrainingFast TrainingFast Adversarial TrainingTraining TimeDecay RateModel WeightsExtra TimeTraining EpochsHistorical ProcessesStandard TrainingRegularization MethodAdversarial ExamplesAttack ScenariosAdversarial PerturbationsAttack Success RateAdversarial RobustnessOptimization ProblemAverage WeightLearning RateDeep Neural NetworkFast Gradient Sign MethodPrevious TrainingPrevious EpochAdversarial AttacksPrevious BatchAttack MethodsProjected Gradient DescentTraining ExamplesDecay FactorRandom InitializationFast adversarial trainingprior-guidedknowledgetraining timemodel robustness
Abstracts:Fast adversarial training (FAT) is an efficient method to improve robustness in white-box attack scenarios. However, the original FAT suffers from catastrophic overfitting, which dramatically and suddenly reduces robustness after a few training epochs. Although various FAT variants have been proposed to prevent overfitting, they require high training time. In this paper, we investigate the relationship between adversarial example quality and catastrophic overfitting by comparing the training processes of standard adversarial training and FAT. We find that catastrophic overfitting occurs when the attack success rate of adversarial examples becomes worse. Based on this observation, we propose a positive prior-guided adversarial initialization to prevent overfitting by improving adversarial example quality without extra training time. This initialization is generated by using high-quality adversarial perturbations from the historical training process. We provide theoretical analysis for the proposed initialization and propose a prior-guided regularization method that boosts the smoothness of the loss function. Additionally, we design a prior-guided ensemble FAT method that averages the different model weights of historical models using different decay rates. Our proposed method, called FGSM-PGK, assembles the prior-guided knowledge, i.e., the prior-guided initialization and model weights, acquired during the historical training process. The proposed method can effectively improve the model's adversarial robustness in white-box attack scenarios. Evaluations of four datasets demonstrate the superiority of the proposed method.
Neural 3D Scene Reconstruction With Indoor Planar Priors
Xiaowei ZhouHaoyu GuoSida PengYuxi XiaoHaotong LinQianqian WangGuofeng ZhangHujun Bao
Keywords:Image reconstructionThree-dimensional displaysSemanticsGeometrySemantic segmentationRendering (computer graphics)Optimization3D Reconstruction3D SceneScene Reconstruction3D SpaceSemantic SegmentationDepth MapJoint OptimizationReconstruction QualityReconstruction ResultsSemantic NetworkWall RegionPlanar RegionsMulti-view ImagesScene GeometrySigned Distance FunctionResults In TablePoint CloudNeural RepresentationsHash FunctionNormal DirectionDepth EstimationDistance MapView SynthesisMulti-view StereoScene RepresentationCamera PoseColor Fields2D SegmentationAdaptive ClusteringSemantic Segmentation Results3D reconstructionimplicit neural representationsthe manhattan-world assumptionthe Atlanta-world assumption
Abstracts:This paper addresses the challenge of reconstructing 3D indoor scenes from multi-view images. Many previous works have shown impressive reconstruction results on textured objects, but they still have difficulty in handling low-textured planar regions, which are common in indoor scenes. An approach to solving this issue is to incorporate planar constraints into the depth map estimation in multi-view stereo-based methods, but the per-view plane estimation and depth optimization lack both efficiency and multi-view consistency. In this work, we show that the planar constraints can be conveniently integrated into the recent implicit neural representation-based reconstruction methods. Specifically, we use an MLP network to represent the signed distance function as the scene geometry. Based on the Manhattan-world assumption and the Atlanta-world assumption, planar constraints are employed to regularize the geometry in floor and wall regions predicted by a 2D semantic segmentation network. To resolve the inaccurate segmentation, we encode the semantics of 3D points with another MLP and design a novel loss that jointly optimizes the scene geometry and semantics in 3D space. Experiments on ScanNet and 7-Scenes datasets show that the proposed method outperforms previous methods by a large margin on 3D reconstruction quality.
Hot Journals