Welcome to the IKCEST
Journal
IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE Transactions on Pattern Analysis and Machine Intelligence

Archives Papers: 871
IEEE Xplore
Please choose volume & issue:
Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution
Ao LiLe ZhangYun LiuCe Zhu
Keywords:TransformersQuantization (signal)TrainingFeature extractionComputer architectureSuperresolutionImage reconstructionData miningCalibrationConvolutional codesSuper-resolutionSingle Image Super-resolutionQuantification MethodGlobal InformationResidual BlockLong-range DependenciesTransformer-based MethodsQuantumConvolutional Neural NetworkHigh-resolution ImagesConvolutional LayersEfficient SolutionBoundary ValuePeak Signal-to-noise RatioTransformer ModelPoint Spread FunctionResidual ConnectionHigh Frequency ComponentsQuantification ModelConvolutional Neural Network StructureQuantization SchemeMagnification FactorReconstructed Image QualityBicubic InterpolationQuantization BitsReconstruction ModuleLocal Feature ExtractionMultilayer PerceptronReconstruction QualityExpansion RatioSuper-resolutiontransformerfrequency priorsquantizationlow-bit
Abstracts:Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range dependencies. However, most of the current research in this area has prioritized the design of transformer blocks to capture global information, while overlooking the importance of incorporating high-frequency priors, which we believe could be beneficial. In our study, we conducted a series of experiments and found that transformer structures are more adept at capturing low-frequency information, but have limited capacity in constructing high-frequency representations when compared to their convolutional counterparts. Our proposed solution, the cross-refinement adaptive feature modulation transformer (CRAFT), integrates the strengths of both convolutional and transformer structures. It comprises three key components: the high-frequency enhancement residual block (HFERB) for extracting high-frequency information, the shift rectangle window attention block (SRWAB) for capturing global information, and the hybrid fusion block (HFB) for refining the global representation. To tackle the inherent intricacies of transformer structures, we introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency. These strategies incorporate adaptive dual clipping and boundary refinement. To further amplify the versatility of our proposed approach, we extend our PTQ strategy to function as a general quantization method for transformer-based SISR techniques. Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods, both in full-precision and quantization scenarios. These results underscore the efficacy and universality of our PTQ strategy.
WAKE: Towards Robust and Physically Feasible Trajectory Prediction for Autonomous Vehicles With WAvelet and KinEmatics Synergy
Chengyue WangHaicheng LiaoZhenning LiChengzhong Xu
Keywords:TrajectoryPredictive modelsAccuracyHidden Markov modelsData modelsVehicle dynamicsKinematicsDiscrete wavelet transformsAutonomous vehiclesVehiclesAutonomous VehiclesTrajectory PredictionPhysical FeasibilityPrediction ModelMachine LearningPrediction AccuracyMissing ObservationsInteraction DatasetsInteractiveRoot Mean Square ErrorModel PerformanceIncomplete DataWavelet TransformHistorical ConditionsVariety Of ScenariosModel Predictive ControlMotion PatternsPrediction HorizonGraph Neural NetworksVehicle DynamicsHistorical ObservationsTarget VehicleObservational ConstraintsDecline In AccuracyGraph Attention NetworkSuperposition Of WavesSteering AngleState-of-the-art ModelsReal-time PerformanceAutonomous drivingtrajectory predictionmissing observationsphysics-informed models
Abstracts:Addressing the pervasive challenge of imperfect data in autonomous vehicle (AV) systems, this study pioneers an integrated trajectory prediction model, WAKE, that fuses physics-informed methodologies with sophisticated machine learning techniques. Our model operates in two principal stages: the initial stage utilizes a Wavelet Reconstruction Network to accurately reconstruct missing observations, thereby preparing a robust dataset for further processing. This is followed by the Kinematic Bicycle Model which ensures that reconstructed trajectory predictions adhere strictly to physical laws governing vehicular motion. The integration of these physics-based insights with a subsequent machine learning stage, featuring a Quantum Mechanics-Inspired Interaction-aware Module, allows for sophisticated modeling of complex vehicle interactions. This fusion approach not only enhances the prediction accuracy but also enriches the model's ability to handle real-world variability and unpredictability. Extensive tests using specific versions of MoCAD, NGSIM, HighD, INTERACTION, and nuScenes datasets featuring missing observational data, have demonstrated the superior performance of our model in terms of both accuracy and physical feasibility, particularly in scenarios with significant data loss—up to 75% missing observations. Our findings underscore the potency of combining physics-informed models with advanced machine learning frameworks to advance autonomous driving technologies, aligning with the interdisciplinary nature of information fusion.
OmniTracker: Unifying Visual Object Tracking by Tracking-With-Detection
Junke WangZuxuan WuDongdong ChenChong LuoXiyang DaiLu YuanYu-Gang Jiang
Keywords:Target trackingDetectorsObject trackingTrainingFeature extractionTrajectoryVisualizationPipelinesCorrelationBenchmark testingObject TrackingVisual Object TrackingTypes Of TasksUnified ModelBounding BoxModel WeightsTarget ObjectVideo SequencesTracking TaskTrack ModelTracking DatasetMultiple Object TrackingTraining SetReference FrameObject DetectionPedestrianKalman FilterTemporal InformationObject MotionObject SegmentationObject In FrameMultiple TrackingJoint TrainingMemory BankDetection HeadSearch RegionTracking ResultsTransformer EncoderInstance SegmentationAppearance InformationUnified tracking modelstracking-with-detectionsingle object trackingvideo object segmentationmultiple object trackingmultiple object tracking and segmentationvideo instance segmentation
Abstracts:Visual Object Tracking (VOT) aims to estimate the positions of target objects in a video sequence, which is an important vision task with various real-world applications. Depending on whether the initial states of target objects are specified by provided annotations in the first frame or the categories, VOT could be classified as instance tracking (e.g., SOT and VOS) and category tracking (e.g., MOT, MOTS, and VIS) tasks. Different definitions have led to divergent solutions for these two types of tasks, resulting in redundant training expenses and parameter overhead. In this paper, combing the advantages of the best practices developed in both communities, we propose a novel tracking-with-detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for the association. Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline, eliminating the need for task-specific architectures and reducing redundancy in model parameters. We conduct extensive experimentation on seven prominent tracking datasets of different tracking tasks, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, and demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.
Exploiting Ground Depth Estimation for Mobile Monocular 3D Object Detection
Yunsong ZhouQuan LiuHongzi ZhuYunzhe LiShan ChangMinyi Guo
Keywords:Three-dimensional displaysCamerasTransformersFeature extractionDetectorsAccuracyObject detectionDepth measurementRobustnessTransformer coresObject DetectionDepth Estimation3D Object DetectionMonocular 3D Object DetectionTransformerImage FeaturesFeature MapsDepth ImagesFeature FusionKITTI DatasetMonocular CameraFeature Fusion NetworkPoint CloudGround PlaneObject PositionPitch AngleRoll AngleObject PointCamera PoseAccurate Depth3D DetectionCamera Coordinate SystemFootpoints3D Bounding BoxObject DepthPinhole Camera ModelPinhole CameraCamera Intrinsic ParametersReal DepthMonocular 3D object detectionground depth estimationvision transformerautonomous driving
Abstracts:Detecting 3D objects from a monocular camera in mobile applications, such as on a vehicle, drone, or robot, is a crucial but challenging task. The monocular vision’s near-far disparity and the camera’s constantly changing position make it difficult to achieve high accuracy, especially for distant objects. In this paper, we propose a new Mono3D framework named MoGDE, which takes inspiration from the observation that an object’s depth can be inferred from the ground’s depth underneath it. MoGDE estimates the corresponding ground depth of an image and utilizes this information to guide Mono3D. We use a pose detection network to estimate the camera’s orientation and construct a feature map that represents pixel-level ground depth based on the 3D-to-2D perspective geometry. To further improve Mono3D with the estimated ground depth, we design an RGB-D feature fusion network based on transformer architecture. The long-range self-attention mechanism is utilized to identify ground-contacting points and pin the corresponding ground depth to the image feature map. We evaluate MoGDE on the KITTI dataset, and the results show that it significantly improves the accuracy and robustness of Mono3D for both near and far objects. MoGDE outperforms state-of-the-art methods and ranks first among the pure image-based methods on the KITTI 3D benchmark.
Condition-Invariant Semantic Segmentation
Christos SakaridisDavid BruggemannFisher YuLuc Van Gool
Keywords:TrainingFeature extractionSemanticsSemantic segmentationDecodingBenchmark testingInformation technologyElectrical engineeringComputer architectureMicrowave integrated circuitsSemantic SegmentationInput ImageInternal QualityVisual ConditionsDomain AdaptationSelf-drivingSemantic NetworkInvariant FeaturesEncoder NetworkSemantic Segmentation NetworkTraining SetValidation SetAdverse ConditionsCross-entropy LossLevel CharacteristicsTarget ImageSource ImagesTarget DomainTraffic LightSource DomainCommon BaselineUnsupervised Domain Adaptation MethodsMean Intersection Over UnionNight-time ImagesDomain Adaptation MethodsEntire Test SetBasic SetupSet Of TablesTarget Domain ImagesBasic FormulationSemantic segmentationdomain adaptationadverse conditionsinvarianceunsupervised learning
Abstracts:Adaptation of semantic segmentation networks to different visual conditions is vital for robust perception in autonomous cars and robots. However, previous work has shown that most feature-level adaptation methods, which employ adversarial training and are validated on synthetic-to-real adaptation, provide marginal gains in condition-level adaptation, being outperformed by simple pixel-level adaptation via stylization. Motivated by these findings, we propose to leverage stylization in performing feature-level adaptation by aligning the internal network features extracted by the encoder of the network from the original and the stylized view of each input image with a novel feature invariance loss. In this way, we encourage the encoder to extract features that are already invariant to the style of the input, allowing the decoder to focus on parsing these features and not on further abstracting from the specific style of the input. We implement our method, named Condition-Invariant Semantic Segmentation (CISS), on the current state-of-the-art domain adaptation architecture and achieve outstanding results on condition-level adaptation. In particular, CISS sets the new state of the art in the popular daytime-to-nighttime Cityscapes $\to$→ Dark Zurich benchmark. Furthermore, our method achieves the second-best performance on the normal-to-adverse Cityscapes $\to$→ ACDC benchmark. CISS is shown to generalize well to domains unseen during training, such as BDD100K-night and ACDC-night.
Long-Term Feature Extraction via Frequency Prediction for Efficient Reinforcement Learning
Jie WangMingxuan YeYufei KuangRui YangWengang ZhouHouqiang LiFeng Wu
Keywords:Predictive modelsFrequency-domain analysisFeature extractionAccuracyRepresentation learningTrajectoryFourier transformsVisualizationTime-domain analysisTime series analysisEfficient LearningLong-term CharacteristicsTime SeriesFourier TransformStructural InformationTime DomainFrequency DomainFuture ConditionsRepresentation LearningRepresentation Of SpaceOptimal PolicySampling EfficiencyDeep Reinforcement LearningSequence Of StatesRepresentation Learning MethodsReinforcement Learning TaskStandard Reinforcement LearningTime StepLearning ModelsState SpaceAuxiliary TaskReward SignalMulti-step PredictionAuxiliary LossMarkov Decision ProcessReward FunctionRaw StateOptimal Value FunctionReplay BufferReinforcement Learning MethodsReinforcement learningrepresentation learningstate sequences predictionfourier transform
Abstracts:Sample efficiency remains a key challenge for the deployment of deep reinforcement learning (RL) in real-world scenarios. A common approach is to learn efficient representations through future prediction tasks, facilitating the agent to make farsighted decisions that benefit its long-term performance. Existing methods extract predictive features by predicting multi-step future state signals. However, they do not fully exploit the structural information inherent in sequential state signals, which can potentially improve the quality of long-term decision-making but is difficult to discern in the time domain. To tackle this problem, we introduce a new perspective that leverages the frequency domain of state sequences to extract the underlying patterns in time series data. We theoretically show that state sequences contain structural information closely tied to policy performance and signal regularity and analyze the fitness of the frequency domain for extracting these two types of structural information. Inspired by that, we propose a novel representation learning method, State Sequences Prediction via Fourier Transform (SPF), which extracts long-term features by predicting the Fourier transform of infinite-step future state sequences. The appealing features of our frequency prediction objective include: 1) simple to implement due to a recursive relationship; 2) providing an upper bound on the performance difference between the optimal policy and the latent policy in the representation space. Experiments on standard and goal-conditioned RL tasks demonstrate that the proposed method outperforms several state-of-the-art algorithms in terms of both sample efficiency and performance.
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
Xiao WangJianlong WuZijia LinFuzheng ZhangDi ZhangLiqiang Nie
Keywords:NoiseAnnotationsIterative methodsScalabilityData modelsQuestion answering (information retrieval)Foundation modelsArtLarge language modelsRefiningData QualityScalableData RefinementVideo ContentNoise DistributionIterative RefinementNoise ControlTheoretical GuaranteesPre-training DatasetTraining DataImplementation DetailsKullback-LeiblerLanguage ModelEvaluation DatasetReal DistributionImprove Data QualityRefinement MethodTextual ElementsFoundation ModelTotal Variation DistanceReal Data DistributionText AnnotationIterative StagesVideo EncodingLong-tailed DistributionBatch Size SetDataset SectionTemporal AttentionEarly Stage Of TrainingVideo-language pre-trainingdata-centricvideo question answeringtext-video retrieval
Abstracts:Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an “impossible trinity” among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully refine the original annotations by leveraging useful information in multimodal video content (frames, tags, ASR transcripts, etc.). Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset. Then, we pre-train on it and fine-tune on human refinement examples for a stronger model. These processes are repeated for continuous improvement. For noise control, we present AdaTaiLr, a novel method that requires weaker assumptions on noise distribution. This method proves more effective in large datasets and offers theoretical guarantees. The combination of iterative refinement and AdaTaiLr can achieve better scalability in video-language understanding. Extensive experiments show that our framework outperforms existing data refinement baselines, delivering a 3% performance boost and improving dataset quality with minimal diversity loss. Furthermore, our refined dataset facilitates significant improvements in various video-language understanding tasks, including video question answering and text-video retrieval.
Adaptive Biased Stochastic Optimization
Zhuang Yang
Keywords:Stochastic processesOptimizationRadio frequencyConvergenceMachine learning algorithmsMachine learningComplexity theoryNumerical modelsAdaptation modelsSupport vector machinesStochastic OptimizationMachine LearningObjective FunctionOptimization AlgorithmMild ConditionsConvergence RateStochastic GradientConjugate GradientVariance ReductionGradient MethodStochastic AlgorithmRMSpropGradient-based AlgorithmAdaptive OptimizationAdaGradStrongly ConvexAdaptive GradientLinear Convergence RateWell-known ComplexLogistic Regression ModelSecond-order InformationCase In SectionLearning RateGradient ApproximationAdaptive Learning RateStochastic ApproximationStochastic Gradient DescentSupport Vector MachineCase Of TheoremIterative SchemeStochastic optimizationbiased gradient estimationconvergence analysisnumerical stabilityadaptivity
Abstracts:This work develops and analyzes a class of adaptive biased stochastic optimization (ABSO) algorithms from the perspective of the GEneralized Adaptive gRadient (GEAR) method that contains Adam, AdaGrad, RMSProp, etc. Particularly, two preferred biased stochastic optimization (BSO) algorithms, the biased stochastic variance reduction gradient (BSVRG) algorithm and the stochastic recursive gradient algorithm (SARAH), equipped with GEAR, are first considered in this work, leading to two ABSO algorithms: BSVRG-GEAR and SARAH-GEAR. We present a uniform analysis of ABSO algorithms for minimizing strongly convex (SC) and Polyak-Łojasiewicz (PŁ) composite objective functions. Second, we also use our framework to develop another novel BSO algorithm, adaptive biased stochastic conjugate gradient (coined BSCG-GEAR), which achieves the well-known oracle complexity. Specifically, under mild conditions, we prove that the resulting ABSO algorithms attain a linear convergence rate on both PŁ and SC cases. Moreover, we show that the complexity of the resulting ABSO algorithms is comparable to that of advanced stochastic gradient-based algorithms. Finally, we demonstrate the empirical superiority and the numerical stability of the resulting ABSO algorithms by conducting numerical experiments on different applications of machine learning.
Privacy-Preserving Biometric Verification With Handwritten Random Digit String
Peirong ZhangYuliang LiuSongxuan LaiHongliang LiLianwen Jin
Keywords:WritingForgeryBiometricsPrivacyBiological system modelingData privacyTrainingSocial networking (online)BlockchainsAuthenticationRandom DigitRandom StringDigit StringsBiometric VerificationPersonal InformationPrivacy ProtectionMalicious AttacksDiscriminative PatternsPrivacy BreachesPattern MiningTraining DataInternet Of ThingsSocial Networking SitesDiverse ContentPersonal Identification NumberHandwritten DigitsAccuracy VerificationPrivacy PreservationDynamic Time WarpingSignature VerificationStyle ModelKey SegmentsVerification SystemBlockchainAcquisition SessionReliability VerificationSelection PoolFixed-length VectorIdentity VerificationHandwriting verificationprivacy-preservingbiometricsrandom digit string
Abstracts:Handwriting verification has stood as a steadfast identity authentication method for decades. However, this technique risks potential privacy breaches due to the inclusion of personal information in handwritten biometrics such as signatures. To address this concern, we propose using the Random Digit String (RDS) for privacy-preserving handwriting verification. This approach allows users to authenticate themselves by writing an arbitrary digit sequence, effectively ensuring privacy protection. To evaluate the effectiveness of RDS, we construct a new HRDS4BV dataset composed of online naturally handwritten RDS. Unlike conventional handwriting, RDS encompasses unconstrained and variable content, posing significant challenges for modeling consistent personal writing style. To surmount this, we propose the Pattern Attentive VErification Network (PAVENet), along with a Discriminative Pattern Mining (DPM) module. DPM adaptively enhances the recognition of consistent and discriminative writing patterns, thus refining handwriting style representation. Through comprehensive evaluations, we scrutinize the applicability of online RDS verification and showcase a pronounced outperformance of our model over existing methods. Furthermore, we discover a noteworthy forgery phenomenon that deviates from prior findings and discuss its positive impact in countering malicious impostor attacks. Substantially, our work underscores the feasibility of privacy-preserving biometric verification and propels the prospects of its broader acceptance and application.
UniMatch V2: Pushing the Limit of Semi-Supervised Semantic Segmentation
Lihe YangZhen ZhaoHengshuang Zhao
Keywords:TrainingSemantic segmentationPipelinesPredictive modelsComputer visionBenchmark testingAnnotationsVisualizationTransformersStreaming mediaSemantic SegmentationSemi-supervised Semantic SegmentationCareful DesignFewer ParametersTraining CostsCOCO DatasetUnlabeled ImagesConsistency RegularizationLearning RateData AugmentationRemote SensingUnlabeled DataConfidence ThresholdSemi-supervised LearningChallenging DatasetMedical Image AnalysisPseudo LabelsLatest WorkFoundation ModelPre-trained EncoderSingle StreamColor DistortionDual ViewSemi-supervised ClassificationColor JitteringFramework SectionSemantic Segmentation ModelsGPU MemoryFinal LossSemi-supervised learningsemantic segmentationweak-to-strong consistencyvision transformer
Abstracts:Semi-supervised semantic segmentation (SSS) aims at learning rich visual knowledge from cheap unlabeled images to enhance semantic segmentation capability. Among recent works, UniMatch (Yang et al. 2023) improves its precedents tremendously by amplifying the practice of weak-to-strong consistency regularization. Subsequent works typically follow similar pipelines and propose various delicate designs. Despite the achieved progress, strangely, even in this flourishing era of numerous powerful vision models, almost all SSS works are still sticking to 1) using outdated ResNet encoders with small-scale ImageNet-1 K pre-training, and 2) evaluation on simple Pascal and Cityscapes datasets. In this work, we argue that, it is necessary to switch the baseline of SSS from ResNet-based encoders to more capable ViT-based encoders (e.g., DINOv2) that are pre-trained on massive data. A simple update on the encoder (even using 2× fewer parameters) can bring more significant improvement than careful method designs. Built on this competitive baseline, we present our upgraded and simplified UniMatch V2, inheriting the core spirit of weak-to-strong consistency from V1, but requiring less training cost and providing consistently better results. Additionally, witnessing the gradually saturated performance on Pascal and Cityscapes, we appeal that we should focus on more challenging benchmarks with complex taxonomy, such as ADE20K and COCO datasets.
Hot Journals