-
Generative Causality-Driven Network for Graph Multi-Task Learning
Xixun LinQing YuYanan CaoLixin ZouChuan ZhouJia WuChenliang LiPeng ZhangShirui Pan
Keywords:MultitaskingTrainingMathematical modelsGraph neural networksData modelsGeneratorsPredictive modelsOptimizationNumerical analysisCorrelationMulti-task LearningNeural NetworkFeature SpaceReal-world DatasetsGraph Neural NetworksCausal StructureTheoretical DerivationOutput SpaceCausal FrameworkMachine Learning ParadigmEnergy-based ModelTrue EdgesMulti-task Learning MethodUniform DistributionStructural Equation ModelingLearning TaskTime ComplexityEnergy FunctionMultilayer PerceptronStructure Of SpaceDirected Acyclic GraphNoise DistributionCausal GraphGraph Convolutional NetworkLabeling TaskNode FeaturesStochastic Block ModelDistribution Of TasksLabel DistributionUpdated ModelGraph machine learninggraph multi-task learninggraph neural networks
Abstracts:Multi-task learning (MTL) is a standard learning paradigm in machine learning. The central idea of MTL is to capture the shared knowledge among multiple tasks for mitigating the problem of data sparsity where the annotated samples for each task are quite limited. Recent studies indicate that graph multi-task learning (GMTL) yields the promising improvement over previous MTL methods. GMTL represents tasks on a task relation graph, and further leverages graph neural networks (GNNs) to learn complex task relationships. Although GMTL achieves the better performance, the construction of task relation graph heavily depends on simple heuristic tricks, which results in the existence of spurious task correlations and the absence of true edges between tasks with strong connections. This problem largely limits the effectiveness of GMTL. To this end, we propose the Generative Causality-driven Network (GCNet), a novel framework that progressively learns the causal structure between tasks to discover which tasks are beneficial to be jointly trained for improving generalization ability and model robustness. To be specific, in the feature space, GCNet first introduces a feature-level generator to generate the structure prior for reducing learning difficulty. Afterwards, GCNet develops a output-level generator which is parameterized as a new causal energy-based model (EBM) to refine the learned structure prior in the output space driven by causality. Benefiting from our proposed causal framework, we theoretically derive an intervention contrastive estimation for training this causal EBM efficiently. Experiments are conducted on multiple synthetic and real-world datasets. Extensive empirical results and model analyses demonstrate the superior performance of GCNet over several competitive MTL baselines.
-
SS-NeRF: Physically Based Sparse Spectral Rendering With Neural Radiance Field
Ru LiJia LiuGuanghui LiuShengping ZhangBing ZengShuaicheng Liu
Keywords:Rendering (computer graphics)Neural radiance fieldImage reconstructionImage color analysisThree-dimensional displaysGeometryTrainingPipelinesData miningCamerasRadiance FieldNeural Radiance FieldsMultilayer PerceptronRGB ImagesSpectral RadianceScene ReconstructionSpectrum MappingSparse InputLeast-squares3D SpaceDepth MapLow-level FeaturesSpectral ImagingRow Of FigSpectral DistributionReconstruction LossMixed MaterialsSparse Method3D SceneView SynthesisSpectral Power DistributionReal-world ScenesSpectral DatasetGeometric ConsistencyAttention GateSpatial ContinuitySimple Linear Iterative ClusteringVisible LightDepth ValuesSpectral renderingneural radiance fieldsparse scene reconstruction
Abstracts:In this paper, we propose SS-NeRF, the end-to-end Neural Radiance Field (NeRF)-based architectures for high-quality physically based rendering with sparse inputs. We modify the classical spectral rendering into two main steps, 1) the generation of a series of spectrum maps spanning different wavelengths, 2) the combination of these spectrum maps for the RGB output. The proposed architecture follows these two steps through the proposed multi-layer perceptron (MLP)-based architecture (SpectralMLP) and spectrum attention UNet (SAUNet). Given the ray origin and the ray direction, the SpectralMLP constructs the spectral radiance field to obtain spectrum maps of novel views, which are then sent to the SAUNet to produce RGB images of white-light illumination. Applying NeRF to build up the spectral rendering is a more physically-based way from the perspective of ray-tracing. Further, the spectral radiance fields decompose difficult scenes and improve the performance of NeRF-based methods. Previous baseline, such as SpectralNeRF, outperforms recent methods in synthesizing novel views but requires relatively dense viewpoints for accurate scene reconstruction. To tackle this, we propose SS-NeRF to enhance the detail of scene representation with sparse inputs. In SS-NeRF, we first design the depth-aware continuity to optimize the reconstruction based on single-view depth predictions. Then, the geometric-projected consistency is introduced to optimize the multi-view geometry alignment. Additionally, we introduce a superpixel-aligned consistency to ensure that the average color within each superpixel region remains consistent. Comprehensive experimental results demonstrate that the proposed method is superior to recent state-of-the-art methods when synthesizing new views on both synthetic and real-world datasets.
-
Efficient Nearest Neighbor Search Using Dynamic Programming
Pengfei WangJiantao SongShiqing XinShuangmin ChenChanghe TuWenping WangJiaye Wang
Keywords:Point cloud compressionNearest neighbor methodsGeneratorsThree-dimensional displaysOctreesManifoldsDynamic programmingDistributed databasesComputer graphicsClustering algorithmsDynamic ProgrammingNearest Neighbor SearchSearch AlgorithmPoint CloudConstruction ProcessClosest PointVoronoi DiagramSpatial PartitioningIncremental ProcessIterative Closest PointKey ApplicationsQuery PointQuery PerformanceTime ComplexityNumber Of ComparisonsBounding BoxNumber Of ObjectsLine SegmentLeaf NodeNearest PointDelaunay TriangulationIterative Closest Point AlgorithmQuery ListPreprocessing TimeQuery EfficiencyMinimum Bounding BoxTree SegmentationNearest DistanceWorst-case ComplexityQuery LengthNearest neighbor searchdelaunay triangulationvoronoi diagramfarthest point samplingdensity peak clustering
Abstracts:Given a collection of points in $\mathbb {R}^{3}$R3, KD-Tree and R-Tree are well-known nearest neighbor search (NNS) algorithms that rely on spatial partitioning and indexing techniques. However, when the query point is far from the data points or the data points inherently represent a 2-manifold surface, their query performance may degrade. To address this, we propose a novel dynamic programming technique that precomputes a Directed Acyclic Graph (DAG) to encode the proximity structure between data points. More specifically, the DAG captures how the proximity structure evolves during the incremental construction of the Voronoi diagram of the data points. Experimental results demonstrate that our method achieves a speed increase of 1-10x. Furthermore, our algorithm demonstrates significant practical value in diverse applications. We validated its effectiveness through extensive testing in four key applications: Point-to-Mesh Distance Queries, Iterative Closest Point (ICP) Registration, Density Peak Clustering, and Point-to-Segments Distance Queries. A particularly notable feature of our approach is its unique ability to efficiently identify the nearest neighbor among the first $k$k points in the point cloud, a capability that enables substantial acceleration in low-dimensional applications like Density Peak Clustering. As a natural extension of our incremental construction process, our method can also be readily adapted for farthest-point sampling tasks. These experimental results across multiple domains underscore the broad applicability and practical importance of our approach.
-
PlaneRecTR++: Unified Query Learning for Joint 3D Planar Reconstruction and Pose Estimation
Jingjia ShiShuaifeng ZhiKai Xu
Keywords:Three-dimensional displaysCamerasImage reconstructionPose estimationTransformersImage segmentationTrainingPipelinesOptimizationGeometryPose EstimationPublic DatasetsDistinct ModesDivide-and-conquerCamera PoseRelative PoseDepth PredictionPose PredictionCamera Pose EstimationConvolutional Neural NetworkSuperior PerformanceInput ImageFeature Maps3D ReconstructionLine SegmentPublic KeyKey DesignPlanar GeometryPlanar RegionsRotation ErrorParameter PlanePlanar Graphs3D PlaneAttention MatrixDepth PlanesBipartite MatchingTransformation ModulePositional EncodingSparse ReconstructionDirect SupervisionPlanar reconstructionquery learningrelative pose estimationsparse views reconstruction
Abstracts:The challenging task of 3D planar reconstruction from images involves several sub-tasks including frame-wise plane detection, segmentation, parameter regression and possibly depth prediction, along with cross-frame plane correspondence and relative camera pose estimation. Previous works adopt a divide and conquer strategy, addressing above sub-tasks with distinct network modules in a two-stage paradigm. Specifically, given an initial camera pose and per-frame plane predictions from the first stage, further exclusively designed modules relying on external plane correspondence labeling are applied to merge multi-view plane entities and produce refined camera pose. Notably, existing work fails to integrate these closely related sub-tasks into a unified framework, and instead addresses them separately and sequentially, which we identify as a primary source of performance limitations. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all tasks of multi-view planar reconstruction and pose estimation within a compact single-stage framework, eliminating the need for the initial pose estimation and supervision of plane correspondence. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, achieving a new state-of-the-art performance on the public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.
-
3D Hand Pose Estimation via Articulated Anchor-to-Joint 3D Local Regressors
Changlong JiangYang XiaoJinghong ZhengHaohong KuangCunlin WuMingyang ZhangZhiguo CaoMin DuJoey Tianyi ZhouJunsong Yuan
Keywords:HandsThree-dimensional displaysJointsPose estimationTransformersAccuracySolid modelingLocation awarenessArtificial intelligenceRepresentation learningPose EstimationHuman Pose Estimation3D PoseHand PoseHand Pose Estimation3D Hand Pose EstimationLocal RegressorsGeneralization Ability3D SpaceGlobal ContextRGB ImagesDepth ImagesLocal DetailsHand JointsSingle RGB3D TasksSingle HandLocalization AccuracyMulti-scale Features3D JointTarget JointConvolutional FeaturesStrong Generalization AbilityModel-free ApproachTransformer EncoderFeature EnhancementPromising Performance3D hand pose estimationanchor-to-joint regressionanchor articulationTransformerjoint-aware anchor setting
Abstracts:In this paper, we propose to address monocular 3D hand pose estimation from a single RGB or depth image via articulated anchor-to-joint 3D local regressors, in form of A2J-Transformer+. The key idea is to make the local regressors (i.e., anchor points) in 3D space be aware of hand’s local fine details and global articulated context jointly, to facilitate predicting their 3D offsets toward hand joints with linear weighted aggregation for joint localization. Our intuition is that, local fine details help to estimate accurate offset but may suffer from the issues including serious occlusion, confusing similar patterns, and overfitting risk. On the other hand, hand’s global articulated context can essentially provide additional descriptive clues and constraints to alleviate these issues. To set anchor points adaptively in 3D space, A2J-Transformer+ runs in a 2-stage manner. At the first stage, since the input modality property anchor points distribute more densely on X-Y plane, it leads to lower prediction accuracy along Z direction compared with those in the X and Y directions. To alleviate this, at the second stage anchor points are set near the joints yielded by the first stage evenly along X, Y, and Z directions. This treatment brings two main advantages: (1) balancing the prediction accuracy along X, Y, and Z directions, and (2) ensuring the anchor-joint offsets are of small values relatively easy to estimate. Wide-range experiments on three RGB hand datasets (InterHand2.6 M, HO-3D V2 and RHP) and three depth hand datasets (NYU, ICVL and HANDS 2017) verify A2J-Transformer+’s superiority and generalization ability for different modalities (i.e., RGB and depth) and hand cases (i.e., single hand, interacting hands, and hand-object interaction), even outperforming model-based manners. The test on ITOP dataset reveals that, A2J-Transformer+ can also be applied to 3D human pose estimation task.
-
LVOS: A Benchmark for Large-Scale Long-Term Video Object Segmentation
Lingyi HongZhongying LiuWenchao ChenChenzhi TanYuang FengXinyu ZhouPinxue GuoJinglun LiZhaoyu ChenShuyong GaoWei ZhangWenqiang Zhang
Keywords:VideosObject segmentationBenchmark testingAnnotationsTrainingAnalytical modelsAccuracyPipelinesPattern analysisComplexity theoryObject SegmentationLarge-scale ObjectVideo Object SegmentationLong-term VideoLong-term ObjectReal-world ScenariosTarget ObjectSimilar ObjectsReal ScenesLong-term DatasetsVideo LengthHigh-quality AnnotationsVideo ObjectModel PerformanceTraining SetValidation SetVisual FeaturesLarge-scale DatasetsBounding BoxLinguistic FeaturesObject In FrameLong-term TaskMemory BankPrecise AnnotationComplex MotionVideo DurationInteractive SegmentationError AccumulationLarge-scale VariationEntire VideoVideo object segmentationlarge-scale benchmarklong-term video understandingdataset
Abstracts:Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shelf VOS models, part of the existing VOS benchmarks mainly focuses on short-term videos, where objects remain visible most of the time. However, these benchmarks may not fully capture challenges encountered in practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average. Each video includes various attributes, especially challenges encountered in the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models’ performance in real scenarios. Based on LVOS, we evaluate 15 existing VOS models under 3 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that one of the significant factors contributing to accuracy decline is the increased video length, interacting with complex challenges such as long-term reappearance, cross-temporal confusion, and occlusion, which emphasize LVOS’s crucial role. We hope our LVOS can advance development of VOS in real scenes.
-
ACLI: A CNN Pruning Framework Leveraging Adjacent Convolutional Layer Interdependence and $\gamma$γ-Weakly Submodularity
Sadegh TofighMohammad AskarizadehM. Omair AhmadM.N.S. SwamyKim Khoa Nguyen
Keywords:FiltersComplexity theoryAccuracyUpper boundConvolutional neural networksGreedy algorithmsCorrelationBiological system modelingTrainingData miningConvolutional Neural NetworkConvolutional LayersAdjacent LayersReduction In ParametersPruning MethodPruning TechniquesOptimization ProblemUpper BoundComputational ComplexityOutput LayerInput ImageLow ComplexityImageNetResource EfficiencyRow Of TableNP-hard ProblemSelectivity FilterFloating-point OperationsFilters In LayerChannel LayerFilters In Each Convolutional LayerOutput ErrorAverage Absolute ErrorTop-1 AccuracyPruning ProcessParameter CountNumber Of FiltersNormal GroupAverage ErrorSelectivity IndexDeep learningpruningdata-freemachine learningconvolutional neural networksmodel comparison
Abstracts:Today, convolutional neural network (CNN) pruning techniques often rely on manually crafted importance criteria and pruning structures. Due to their heuristic nature, these methods may lack generality, and their performance is not guaranteed. In this paper, we propose a theoretical framework to address this challenge by leveraging the concept of $\gamma$γ-weak submodularity, based on a new efficient importance function. By deriving an upper bound on the absolute error in the layer subsequent to the pruned layer, we formulate the importance function as a $\gamma$γ-weakly submodular function. This formulation enables the development of an easy-to-implement, low-complexity, and data-free oblivious algorithm for selecting filters to be removed from a convolutional layer. Extensive experiments show that our method outperforms state-of-the-art benchmark networks across various datasets, with a computational cost comparable to the simplest pruning techniques, such as $l_{2}$l2-norm pruning. Notably, the proposed method achieves an accuracy of 76.52%, compared to 75.15% for the overall best baseline, with a 25.5% reduction in network parameters. According to our proposed resource-efficiency metric for pruning methods, the ACLI approach demonstrates orders-of-magnitude higher efficiency than the other baselines, while maintaining competitive accuracy.
-
Toward Optimal Mixture of Experts System for 3D Object Detection: A Game of Accuracy, Efficiency and Adaptivity
Linshen LiuPu WangGuanlin WuJunyue JiangHao Frank Yang
Keywords:AccuracyThree-dimensional displaysObject detectionLaser radarCamerasTrainingReal-time systemsOptimizationImage edge detectionFeature extractionObject Detection3D Detection3D Object DetectionAutonomous VehiclesComputational OverheadImage FeaturesPedestrianModel SizePoint CloudConfidence ScoreInnovation SystemLow LatencyObject DistanceHardware PlatformHierarchical StrategyEdge DevicesDifficulty Of DetectionMultimodal FeaturesMultimodal MethodsExecution EfficiencyKITTI DatasetBalance EfficiencyParameter CountHardware Constraints3D LiDARFinal Detection3D SpaceTimes SpeedupInference TimeHyperparametersMixture of expert (MoE)computing systemefficiency3D object detectionedge computing
Abstracts:Autonomous vehicles, open-world robots, and other automated systems rely on accurate, efficient perception modules for real-time object detection. Although high-precision models improve reliability, their processing time and computational overhead can hinder real-time performance and raise safety concerns. This paper introduces an Edge-based Mixture-of-Experts Optimal Sensing (EMOS) System that addresses the challenge of co-achieving accuracy, latency and scene adaptivity, further demonstrated in the open-world autonomous driving scenarios. Algorithmically, EMOS fuses multimodal sensor streams via an Adaptive Multimodal Data Bridge and uses a scenario-aware MoE switch to activate only a complementary set of specialized experts as needed. The proposed hierarchical backpropagation and a multiscale pooling layer let model capacity scale with real-world demand complexity. System-wise, an edge-optimized runtime with accelerator-aware scheduling (e.g., ONNX/TensorRT), zero-copy buffering, and overlapped I/O–compute enforces explicit latency/accuracy budgets across diverse driving conditions. Experimental results establish EMOS as the new state of the art: on KITTI, it increases average AP by 3.17% while running $2.6\times$2.6× faster on Nvidia Jetson. On nuScenes, it improves accuracy by 0.2% mAP and 0.5% NDS, with 34% fewer parameters and a $15.35\times$15.35× Nvidia Jetson speedup. Leveraging multimodal data and intelligent experts cooperation, EMOS delivers accurate, efficient and edge-adaptive perception system for autonomous vehicles, thereby ensuring robust, timely responses in real-world scenarios.
-
Pathway-Aware Multimodal Transformer (PAMT): Integrating Pathological Image and Gene Expression for Interpretable Cancer Survival Analysis
Rui YanXueyuan ZhangZihang JiangBaizhi WangXiuwu BianFei RenS. Kevin Zhou
Keywords:PathologyCancerGene expressionTransformersFeature extractionData modelsBiological system modelingAnalytical modelsDeep learningSemanticsSurvival AnalysisCancer SurvivorsPathological ImagesMultimodal TransformerPathological Gene ExpressionCancer Survival AnalysisGene Expression DataBiological PathwaysLung AdenocarcinomaMedical KnowledgeUrothelial CarcinomaLung Squamous Cell CarcinomaTransformer ModelDigital PathologyContrastive LossMultimodal LearningSemantic SpaceMultimodal MethodsGood InterpretabilityFusion StageSelf-supervised LearningTransformer EncoderVision TransformerPatch SelectionPredictive PerformanceDeep LearningGraph Neural NetworksImprove Prediction PerformanceReceiver Operating Characteristic CurveTransformer-based MethodsMultimodal transformermodel interpretabilitysurvival analysispathological image analysisgene expressionHumansNeoplasmsSurvival AnalysisAlgorithmsGene Expression ProfilingImage Interpretation, Computer-AssistedMachine Learning
Abstracts:Integrating multimodal data of pathological image and gene expression for cancer survival analysis can achieve better results than using a single modality. However, existing multimodal learning methods ignore fine-grained interactions between both modalities, especially the interactions between biological pathways and pathological image patches. In this article, we propose a novel Pathway-Aware Multimodal Transformer (PAMT) framework for interpretable cancer survival analysis. Specifically, the PAMT learns fine-grained modality interaction through three stages: (1) In the intra-modal pathway-pathway / patch-patch interaction stage, we use the Transformer model to perform intra-modal information interaction; (2) In the inter-modal pathway-patch alignment stage, we introduce a novel label-free contrastive loss to aligns semantic information between different modalities so that the features of the two modalities are mapped to the same semantic space; and (3) In the inter-modal pathway-patch fusion stage, to model the medical prior knowledge of “genotype determines phenotype”, we propose a pathway-to-patch cross fusion module to perform inter-modal information interaction under the guidance of pathway prior. In addition, the inter-modal cross fusion module of PAMT endows good interpretability, helping a pathologist to screen which pathway plays a key role, to locate where on whole slide image (WSI) are affected by the pathway, and to mine prognosis-relevant pathology image patterns. Experimental results based on three datasets of bladder urothelial carcinoma, lung squamous cell carcinoma, and lung adenocarcinoma demonstrate that the proposed framework significantly outperforms the state-of-the-art methods.
-
Defenses in Adversarial Machine Learning: A Systematic Survey From the Lifecycle Perspective
Baoyuan WuMingli ZhuMeixi ZhengZihao ZhuShaokui WeiMingda ZhangHongrui ChenDanni YuanLi LiuQingshan Liu
Keywords:TrainingData modelsAutomated machine learningRobustnessTaxonomySurveysPredictive modelsAdversarial machine learningPerturbation methodsTraining dataMachine LearningGenerative Adversarial NetworksLife Cycle PerspectiveNeural NetworkDefense MechanismsDeep Neural NetworkUnique PerspectiveMachine Learning SystemsAdversarial ExamplesDefense MethodsDetection MethodsGaussian NoiseTraining StageTarget ClassRobust PredictorUpdated ModelClean SamplesSelf-supervised LearningAdversarial TrainingFederated LearningNeural Architecture SearchAdversarial PerturbationsAdversarial AttacksRobust ArchitectureAdversarial RobustnessPre-training StageBenign SamplesClear ModelSamples In The Feature SpaceSelf-supervised TaskAdversarial machine learningbackdoor defenseweight defenseadversarial example defense
Abstracts:Adversarial phenomena have been widely observed in machine learning (ML) systems, especially those using deep neural networks. These phenomena describe situations where ML systems may produce predictions that are inconsistent and incomprehensible to humans in certain specific cases. Such behavior poses a serious security threat to the practical application of ML systems. To exploit this vulnerability, several advanced attack paradigms have been developed, mainly including backdoor attacks, weight attacks, and adversarial examples. For each individual attack paradigm, various defense mechanisms have been proposed to enhance the robustness of models against the corresponding attacks. However, due to the independence and diversity of these defense paradigms, it is challenging to assess the overall robustness of an ML system against different attack paradigms. This survey aims to provide a systematic review of all existing defense paradigms from a unified lifecycle perspective. Specifically, we decompose a complete ML system into five stages: pre-training, training, post-training, deployment, and inference. We then present a clear taxonomy to categorize representative defense methods at each stage. The unified perspective and taxonomy not only help us analyze defense mechanisms but also enable us to understand the connections and differences among different defense paradigms. It inspires future research to develop more advanced and comprehensive defense strategies.